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similar to lipase, CoPL-RP2 [Homo sapiens] , 
XP_058404 

XP_05 84 04 .2 GI : 274 82986 
REFSEQ: accession XM 058404.5 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleos tomi ; 
Mammalia; Eutheria; Primates; Catarrhini ; Hominidae; Homo. 
MODEL REFSEQ: This record is predicted by automated computational 
analysis. This record is derived from an annotated genomic sequence 
( NT 030059 ) using gene prediction method: GNOMON, supported by mRNA 
and EST evidence. 
Also see: 

Documentation of NCBI ' s Annotation Process 

On Jan 3, 2003 this sequence version replaced qi ; 17472237 . 
Location/Qualifiers 
1. .467 

/organism="Homo sapiens" 
/ db_xr e f = " t axon : 9 6 0 6 " 
/ chr omo s ome ="10" 
1. .467 

/product=" similar to lipase, CoPL-RP2" 
18 . .352 

/region_name= "Lipase " 
/note= "Lipase " 
/ db_xr e f = " CDD : pfamOOlSl " 
355 . .467 

/region_name="PLAT/LH2 domain. This domain is found in a 
variety of membrane or lipid associated proteins. It is 
called the PLAT (Polycystin- 1 , Lipoxygenase, Alpha-Toxin) 
domain or LH2 (Lipoxygenase homology) domain. The known 
structure of pancreatic lipase shows this domain binds to 
procolipase pfam01114, which mediates membrane 
association. So it appears possible that this domain 
mediates membrane attachment via other protein binding 
partners. The structure of this domain is known for many 
members of the family and is composed of a beta sandwich" 
/note="PLAT" 

/db_xref = " CDD : pfam01477 " 
355 . . 458 

/region_name="PLAT (Polycystin- 1 , Lipoxygenase, 
Alpha-Toxin) domain or LH2 (Lipoxygenase homology) 
domain. It consists of an eight stranded beta-barrel. The 
alignment contains 2 structurally very similar subgroups, 
lipoxygenases and lipases. The domain is found at the 
N- terminus in lipoxygenases and at the C terminus in 



http://www.ncbi.nlrn.nih.gov/entrez/viewer.fcgi?db=protein&val=27482986 
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lipases. In lipases this domain contains a disulfide 
bridge. Both types of enzymes are cytosolic but need this / 
domain to access their sequestered membrane or micelle 
bound substrates" 
/note="PLjAT" 
/db_xref="CDD; cd00113 " 
CDS 1 . .467 

/gene="LOC11954 8" 

/coded__by="XM_0584 04 . 5 : 152 . . 1555" 
/db_xref = "Gene ID : 119548 " 
/db_xref="LocusID: 11954 8 " 

ORIGIN 

1 mlgiwivafl ffgtsrgkev cyerlgcfkd glpwtrtfst elvglpwspe kintrfllyt 
61 ihnpnayqei savnsstiqa syfgtdkitr iniagwktdg kwqrdmcnvl Iqledincin 
121 Idwingsrey ihavnnlrw gaevayfidv Imkkfeysps kvhlighslg ahlageagsr 
181 ipglgritgl dpagpffhnt pkevrldpsd anfvdvihtn aarilfelgv gtidacghld 
241 fypnggkhmp gcedlitpll kfnfnaykke masffdcnha rsyqfyaesi Inpdafiayp 
301 crsytsfkag ncffcskegc ptmghfadrf hfknmktngs hyflntgsls pfarwrhkls 
361 vklsgsevtq gtvflrvgga vrktgefaiv sgklepgmty tklidadvnv gnitsvqfiw 
421 kkhlfedsqn klgaemvint sgkygykstf csqdimgpni Iqnlkpc 
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FASTA searches a protein or DNA sequence data bank 
version 3.3t05 March 30, 2000 
Please cite: 

W.R. Pearson & D,J. Lipman PNAS (1988) 85:2444-2448 

/tmp/f astaCAAdsayjO : 467 aa 
>hENZ_100_ORF 

vs /tmp/f astaDAAesayjO library 
searching /tmp/f astaDAAesayj 0 library 

467 residues in 1 sequences 

FASTA (3.34 January 2000) function [optimized, BL50 matrix (15: -5)] ktup : 2 

join: 37, opt: 25, gap-pen: -12/ -2, width: 16 

Scan time: 0,017 
The best scores are: 

gi|27482986|ref |XP_058404.2 I similar to lipase, C ( 467) 3208 

>>gi I 27482986 I ref |XP_058404 .2 I similar to lipase, CoPL-R (467 aa) 

initn: 3208 initl : 3208 opt: 3208 
Smith-Waterman score: 3208; 99.572% identity in 467 aa overlap (1-467:1-467) 

10 20 30 40 50 60 

hENZ_l MLGIWIVAFLFFGTSRGKEVCYERLGCFKDGLPWTRTFSTELVGLPWSPEKINTRFLLYT 

gi I 2 74 MLGIWIVAFLFFGTSRGKEVCYERLGCFKDGLPWTRTFSTELVGLPWSPEKINTRFLLYT 
10 20 30 40 50 60 

70 80 90 100 110 120 

hENZ_l IHNPNAYQEISAVNSSTIQASYFGTDKITRINIAGWKTDGKWQRDMCNVLLQLEDINCIN 

gi I 2 74 IHNPNAYQEISAVNSSTIQASYFGTDKITRIN.IAGWKTDGKWQRDMCNVLLQLEDINCIN 
70 80 90 100 110 120 

130 140 150 160 170 180 

hENZ_l LDWINGSREYIHAVNNLRWGAEVAYFIDVLMKKFEYSPSKVHLIGHSLGAHLAGEAGSR 

gi I 274 LDWINGSREYIHAVNNLRWGAEVAYFIDVLMKKFEYSPSKVHLIGHSLGAHLAGEAGSR 
130 140 150 160 170 180 

190 200 210 220 230 240 

hENZ_l IPGLGRITGLDPAGPFFHNTPKEVRLDPSDANFVDVIHTNAARILFELGVGTIDACGHLD 

gi I 2 74 IPGIiGRITGLDPAGPFFHNTPKEVRLDPSDANFVDVIHTNTU^ILFELGVGTIDACGHLD 
190 200 210 220 230 240 

250 260 270 280 290 300 

hENZ_l FYPNGGKHMPGCEDLITPLLKFNFNAYKKEMASFFDCNHARSYQFYAESILNPDAFIAYP 

gi I 274 FYPNGGKHMPGCEDLITPLLKFNFNAYKKEMASFFDCNHARSYQFYAESILNPDAFIAYP 
250 260 270 280 290 300 

310 320 330 340 350 360 

hENZ_l CRSYTSFKAGNCFFCSKEGCPTMGHFADRFHFKNMKTNGSHYFLNTGSLSPFARWRHKLS 

gi I 2 74 CRSYTSFKAGNCFFCSKEGCPTMGHFADRFHFKNMKTNGSHYFLNTGSLSPFARWRHKLS 
310 320 330 340 350 360 

370 380 390 400 410 420 

hENZ 1 VKLSGSEVTQGTVFLRVGGAIGKTGEFAIVSGKLEPGMTYTKLIDADVNVGNITSVQFIW 
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gi 12 74 VKLSGSEVTQGTVFLRVGGAVRKTGEFA 

370 380 390 400 410 420 

430 440 450 460 

hENZ_l KKHLFEDSQNKLGAEMVINTSGKYGYKSTFCSQDIMGPNILQNLKPC 

gi I 2 74 KKHLFEDSQNKLGAEMVINTSGKYGYKSTFCSQDIMGPNILQNLKPC 
430 440 450 460 



467 residues in 1 query sequences 
467 residues in 1 library sequences 
Scomplib [version 3.3t05 March 30, 2000] 

start: Fri Jul 9 16:26:00 2004 done: Fri Jul 9 16:26:01 2004 
Scan time: 0.017 Display time: 0.317 

Function used was FASTA 
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Citing CD-Search : Marchler-Bauer A, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, He S, 
Hurwitz DI, Jackson JD, Jacobs AR, Lanczycki CJ, Liebert CA, Liu C, Madej T, Marchler GH, 
Mazumder R, Nikolskaya AN, Panchenko AR, Rao BS, Shoemaker BA, Simonyan V, Song JS, 
Thiessen PA, Vasudevan S, Wang Y, Yamashita RA, Yin JJ, and Bryant SH (2003), ''CDD: a curated 
Entrez database of conserved domain alignments'\ Nucleic Acids Res. 31:383-387. 
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Structure and Activity of Rat Pancreatic Lipase-related Protein 2* 

(Received for publication, July 8, 1998) 

Alain Rousselt, Yanqing Yang§, Francine FerratoH, Robert VergerU, Christian CambiUaut, 
and Mark Lowe§[! 

From the departments of Pediatrics arid of Molecular Biology and Pharmacology, Washington ^'^^^^^/^^^^ 
Medicine, St. Louis, Missouri 63110. ^Architecture et Fonction des MacromoUcules Biologiques, CNRS-IFRl ^^^^^fj: 
31 Chemin Joseph Aiguier, 13402 Marseille cedex 20, France, and the ^aboratoire de Lipolyse Enzymatique, CNRb ltHl 
UPR 9025, 31 Chemin Joseph Aiguier, 13402 Marseille cedex 20, France 



The pancreas expresses several members of the lipase 
gene family including pancreatic triglyceride lipase 
(PTL) and two homologous proteins, pancreatic lipase- 
related proteins 1 and 2 (PLRPl and PLRP2). Despite 
their similar amino acid sequences, PTL, PLRPl, and 
PLRP2 differ in important kinetic properties, PLRPl 
has no known activity. PTL and PLRP2 differ in sub- 
strate specificity, bile acid inhibition, colipase require- 
ment, and interfacial activation. To begin understand- 
ing the structural explanations for these functional 
differences, we solved the crystal structure of rat 
(r)PLRP2 and further characterized its kinetic proper- 
ties. The 1.8 A structure of rPLRP2, like the tertiary 
structure of human PTL, has a globular N-terminal do- 
main and a ^-sandwich C-terminal domain. The lid do- 
main occupied the closed position, suggesting that 
rPLRP2 should show interfacial activation. When we 
reexamined this issue with tripropionin as substrate, 
rPLRP2 exhibited interfacial activation. Because the ac- 
tive site topology of rPLRP2 resembled that of human 
PTL, we predicted and demonstrated that the lipase 
inhibitors E600 and tetrahydrolipstatin inhibit rPLRP2. 
Although PTL and rPLRP2 have similar active sites, 
rPLRP2 has a broader substrate specificity that we con- 
firmed using a monolayer technique. With this assay, we 
showed for the first time that rPLRP2 prefers phos- 
phatidylglycerol and ethanolamine over phosphatidyl- 
choline. In summary, we confirmed and extended the 
observation that PLRP2 lipases have a broader sub- 
strate specificity than PTL, we demonstrated that 
PLRP2 lipases show interfacial activation, and we 
solved the first crystal structure of a PLRP2 lipase that 
contains a lid domain. 



Lipases are ubiquitous enzymes expressed by diverse organisms. 
They hydrolyze phospholipids and triglycerides to generate fatty 
acids for energy production or for storage and to release inositol 
phosphates that act as second messengers. The role of phospho- 
lipases in cellular signaling pathways has increased interest in 
these lipases. Similarly, the central role of triglyceride lipases in 
energy production and their potential industrial applications have 
stimulated studies of these essential lipases. As a result, our knowl- 



* This work was supported by National Institutes of Health Grant 
HD3306002. The costs of publication of this article were defrayed in 
part by the payment of page charges. This article must therefore be 
hereby marked ^'advertisement'* in accordance with 18 U.S.C. Section 
1734 solely to indicate this fact. 

The atomic coordinates and structure factors (code lbu8) have been 
deposited in the Protein Data Bank, Brookhaven National Laboratory, 
Upton, NY. 

II To whom correspondence should be addressed. 
This paper is available on line at http://www.jbc.org 



edge about lipases and of the molecular details imderlying lipolysis 
has increased considerably. 

Among these contributions was the cloning of cDNAs encod- 
ing various lipases. Comparisons of the amino acid sequences 
predicted from these cDNAs led to the hypothesis that a lipase 
gene family evolved from a common ancestral hydrolase (1). At 
least three members of the lipase gene family are synthesized 
and secreted by the pancreas. One, the archetype of the family, 
colipase-dependent pancreatic triglyceride lipase (PTL) has 
been studied for over 100 years. Oilier et al, (2) isolated human 
pancreatic cDNAs encoding the other two, named pancreatic 
lipase-related proteins 1 and 2 (PLRPl and PLRP2), 6 years 
ago. The primary sequences of the human PLRPl and PLRP2 
have 68 and 65% identity to the primary sequence of PTL %vith 
conservation of the catalytic triad and major determinants of 
tertiary structure. Subsequently, other groups reported the 
presence of related proteins in the pancreas from several spe- 
cies (3-7). 

The best studied of the PTL homologues is PLRP2. These 
studies provided the first indications that PLRP2 has func- 
tional properties different from those of PTL. Mouse PLRP2 
was cloned from interleukin-4-stimulated cytotoxic T-l>mipho- 
cytes, and rat (r)PLRP2 was cloned as a zymogen granule 
membrane protein, GP3 (4, 7). The presence of PLRP2 in lym- 
phocytes and on the zymogen granule membrane raised the 
possibility that PLRP2 has functions other than hydrolyzing 
dietary fats. For instance, lymphocjrte PLRP2 may participate 
in cell killing, and the PLRP2 on the zymogen granule mem- 
brane may mediate granule fusion with the plasma membrane. 

The expression and purification of PLRP2 lipases allowed 
the enzymatic properties of these enzymes to be characterized. 
These studies revealed that PLRP2 lipases have enzymatic 
properties that distinguish them from PTL (4, 5, 7-9). First, 
PLRP2 has a broader substrate specificity and will hydrolyze 
triglycerides, phosphohpids, and galactolipids. PTL hydrolyzes 
only triglycerides. Second, they have different behaviors in the 
presence of bile salts and with colipase. Third, PLRP2 members 
efficiently hydrolyze monomers of water-soluble, short chain 
triglycerides, whereas PTL possesses low activity against mo- 
nomeric substrates. PTL activity increases dramatically 
against water-insoluble substrates presenting an oil-water in- 
terface, a property known as interfacial activation. Clearly, the 
explanation for these kinetic and functional differences must 
lie in the structure of these proteins. 

The first pancreatic lipase structure to be solved was that of 
the human enzyme (hPTL) (10). Subsequently, the complex of 
hPTL with porcine colipase was elucidated, as were the struc- 



^ The abbreviations used are: PTL, pancreatic triglyceride lipase; Col, 
colipase; c, coypu; g, guinea pig; h, human; PLRPl, pancreatic lipase- 
related protein 1; PLRP2, pancreatic lipase-related protein 2; p, por- 
cine; r, rat; MES, 4-morpholineethanesulfonic acid. 
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tures of hPTL-porcine colipase in complex with phospholipid or 
phosphonate inhibitors (10-13). The structures of horse PTL, 
hPTL-human cohpase complex, and porcine PTL-porcine co- 
lipase complex have also been determined (14-16). The struc- 
tures of hPTL-porcine cohpase crystaUized in the presence of 
mixed phospholipidA)ile salt micelles or of a CI 1 phosphonate 
inhibitor revealed that the hd domain (residues 237-261) cov- 
ering the active site of hPTL could move away from its closed 
position (12, 13). The movement creates new contacts with 
colipase to form the lipid-water interfacial binding site. An- 
other structural element, the /35 loop, undergoes a spatial re- 
organization and folds back on the core of the protein. These 
drastic conformational changes, leading to the open conforma- 
tion, give substrate free access to the catal3rtic triad. 

Comparisons of PLRP2 and PTL family members requires 
high resolution crystal structures, but no structure of a true 
PLRP2 is available. Withers-Martinez et al. (38) reported the 
crystal structure of a chimeric lipase with the C -terminal do- 
main of human PTL and the N-terminal domain of guinea pig 
PLRP2 (gPLRP2). Although inferences about the conformation 
of PLRP2 can be made from the known PTL and gPLRP2 
structure, valid conclusions must be based on the actual PLRP2 
structure. In this paper, we report the first crystal structure of 
a PLRP2 family member that has a lid domain and further 
characterize the enzymatic properties of this lipase. 

MATERIALS AND METHODS 

Expression of Rat PLRP2 in Sf9 Cells— We expressed recombinant 
rat PLRP2 in baculo virus -infected Sf9 cells as described previously with 
the foUoNving modifications (3, 8). The protein was produced in 1-liter 
spinner flasks containing 350 ml of serum-free medium, EX-CELL 400 
(JRH Biosciences, Lenexa, KS), instead of 1 liter of medium. We har- 
vested the medium 3 days instead of 4 days post-infection. The smaller 
medium volume and shorter culture times gave higher protein yields as 
initially observed by Bezzinc et al, (17). 

Purification of Rat PLRP2—Vie removed cells and debris by centri- 



fuging the medium at 5000 rpm for 10 min in a Beckman J2-21 
centrifuge with a JA-20 rotor. The medium was concentrated over a Pall 
Filtron lOk Ultrasette membrane (Filtron Technology Corporation. 
Northborough, MA) to about 50 ml and dialyzed by repeated dilution 
with 10 niM Tris-HCl, pH 8.0, and concentrating over the Ultrasette 
membrane. We appHed the sample to a 75-ml bed volume DEAE-Blue- 
Sepharose (Bio-Rad) equilibrated in the Tris buffer. PLRP2 was eluted 
with a linear NaCl gradient from 0.0 to 0.6 M. Assay with tributyrin in 
a pH-stat located the PLRP2. We pooled the peak fractions and dialyzed 
against 10 mM MES, pH 6.2, buffer followed by concentration over an 
Amicon YM30 membrane (Amicon, Inc., Beverly, MA). The sample was 
appHed to a Pharmacia Mono-S column (5-ml bed volume) (Amersham 

Table I 

Data collection and final refinement statistics 



Data collection 

Resolution limit (A) 

Data completion all/last shell (//s[l] > D 
Redundancy 

Refinement 

Resolution limit (A) 
Number of reflections 
Number of protein atoms 
Number of water molecules 

Final R-factor/R-free (%)* 
B-factors (A*) 

Main chains/side chains 

water 

Ca/EG/GlcNAc 
Root mean square deviations from ideal 

values 
Bonds (A) 
Angles n 

Improper/dihedral angles (*) 



10.0 - 1.8 

97.5/81.9 

2.4 

6.3 

10.0 - 1.8 

36060 

3799 

295 (+1 GlcNAc, 

7 EG, 1 Ca) 
20,3/24.2% 

19.7/22.0 
30.6 

9.4/26.9/41.7 



0.007 
1.4 

1,3/25.5 



"■R,ym = 5: (X I Uh)i - <l{h) > \n <L{h) >Vn\ Uh)i is the observed 
intensity of the ith measurement of reflection and <Lih) > the mean 
intensity of reflection /i. 

* i? = X I - FJ/X IFJ; F„ and are the observed and calculated 
structure factor amplitude, respectively. 
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Fig. 1. Stereo view of the sigma weighted electron density map of the closed lid contoured a 1 a. 




Fig. 2. Structure of rPLRP2. A, stereo view of the Ca trace of rPLRP2. The side chains of the N and C terminus residues as well as those of 
the catalytic Ser' "^'^ and of the lid residue Trp'**'* are represented. The ethylene glycol molecules {blue) and the GlcNAc residue linked to Asn are 
shown. Green, C terminus domain; brown, N-terminal catalytic domain, yellow, catalytic domain following the lid;pmA. lid. B, stereo view of the 
Ca trace of rPLRP2 superimposed on those of the other known closed pancreatic lipase structures. 



Pharmacia Biotech) attached to an Akta Purifier (Amersham Pharma- 
cia Biotech). The column was equilibrated in 50 mM MES, pH 6.2, and 
eluted with a linear NaCl gradient from 0.0 to 1.0 M. rPLRP2 eluted 
from the column as a symmetrical peak identified by activity against 



tributyrin. The peak fractions were pooled, and the pH was adjusted to 
8.0 with 1 M Tris-Cl, pH 8.0. The purified protein migrated as a single 
band on 10% SDS-polyacrylamide gel electrophoresis and had activity 
against tributyrin, trioctanoin, triolein, and phosphotidylcholine as de- 
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Fig. 3. Interfacial activation of rPLRP2. A, interfacial activation 
demonstrated in the laboratory of R. Verger. The tripropionin solutions 
were made in \% gum arabic as described under "Materials and Meth- 
ods." 25 p-g of rPLRP2 was added with a 5-fold molar excess of pure 
porcine colipase. B, interfacial activation demonstrated in the labora- 
tory of M. Lowe. The tripropionin solutions were prepared in 2% gum 
arabic as described under "Materials and Methods." 15 ftg of rPLRP2 
and a 5-fold molar excess of pure human colipase was added. C, inter- 
facial activation with p-nitrophenylbutyrate as the substrate. The assay 
included 20 Mg of rPLRP2 and a 5-fold molar excess of pure human 
colipase. The vertical dashed line in each figure shows the concentration 
when saturation of the tripropionin or p-nitrophenylbutyrate solutions 
occurs. 

scribed previously (8). rPLRP2 concentrations were determined by the 
BCA protein assay using purified bovine serum albumin as the 
standard. 

Lipase Assays— 1,2-rac-Didecanoyl glycerol (dicaprin) was purchased 
from Sigma. 1,2-sn-Didodecanoyl phosphatidylcholine, 1,2-sn-didode- 
canoyl phosphatidylethanol amine, and 1,2-sn-didodecanoyl phosphati- 
dylglycerol were purchased from Fluka (Paris, France). 3-Monogalacto- 
syl-l,2-rac-didodecanoyl glycerol was prepared by chemical s>Tithesis 
and is a generous gift from Professor G. C. Ortaggi (Roma). A rPLPR2 
solution 0.5 mg^ml was used for kinetic experiments using the mono- 
layer technique as well as for the interfacial activation experiments, 
using tripropionin as substrate. Bulk phase assays were done by the 
pH-stat method as described (5, 8, 18). The conditions for inhibitor 
assays are given in the figure legends. Tetrahydrolipstatin was kindly 
provided by Dr. Hans Lengsfeld from Hoffmann-LaRoche. 

Kinetic Experiments on Monolayers — Before each utilization, the Tef- 
lon trough used to form the monomolecular film was cleaned with 
water, then gently brushed in the presence of distilled ethanol, washed 
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Fig. 4. Inhibition of rPLRP2 by tetrahydrolipstatin and E600. 

The activity of rPLRP2 against tributyrin was measured after incubat- 
ing rPLRP2 with either E600 or tetrahydrolipstatin. O, 100-fold molar 
excess of tetrahydrolipstatin in the presence of 4 mM sodium taurode- 
oxycholate in 10% isopropanol. V, 100-fold molar excess of E600 in 5% 
isopropanol without bile salts. Another aliquot of E600 equivalent to a 
100-fold molar excess was again added after the 1-h time point was 
sampled. 

again with tap water, and finally rinsed with double-distilled water (19, 
20). The aqueous subphase was composed of 10 mM Tris-HCl, pH 8.0, 
100 mM NaCl, 21 mM CaClg, and 1 mM EDTA for all lipases. The buffer 
was prepared with double-distilled water and filtered through a 
0,45-/xm MiUipore filter. Any residual surface-active impurities were 
removed before each assay by sweeping and suction of the surface. 
Kinetic experiments were performed vnih a KSV-2200 barostat (KSV- 
Helsinki) and a "zero-order" Teflon trough (20). The trough was 
equipped with a mobile Teflon barrier, which was used to compensate 
for the substrate molecules removed from the film by enzyme hydrolysis 
(monodecanoyl glycerol, decanoic acid, lyso-dodecanoyl phospholipids, 
and dodecanoic acid are soluble in water), thereby keeping the surface 
pressure constant. The latter was measured using a Wilhelmy plate 
(perimeter 3.94 cm) attached to an electromicrobalance, which was 
connected in turn to a microprocessor controlling the movement of the 
mobile barrier. The reactions were performed at room temperature 
(20 *C). The subphase of the reaction compartment was continuously 
agitated with a 2.0-cm magnetic stirrer moving at 250 rpm. The 
rPLRP2 solution (10-60 /j.1 at 0.5 mg/ml) was injected through the film 
over the stirrer with a Hamilton syringe. The surface area of the 
reaction compartment was 31 cm^, and the volume was 55 ml. The 
length of the reservoir compartment was 30 cm, and the width was 17.6 
cm. 

Interfacial Activation of rPLRP2— The TC^ solutions were systemat- 
ically prepared by mixing three times 30 s in a Waring blender, a given 
amount of TCg in 15 ml of 1% gum arabic in water (w/w) (21). Before 
each assay, 5 ml of the TCg gum arabic solution was added to 10 ml of 
pure water in the thermostated (37 "O pH-stat vessel. Deionized water, 
purified with a Millipore Super Q system, was used throughout all the 
experiments. Lipase activity was recorded with either a TTT 80 pH-stat 
(Radiometer) equipped with a 250-^1 syringe containing 0.1 n NaOH or 
a VIT 90 pH-stat (Radiometer) equipped with an automatic burette 
containing 0.05 M NaOH. Activity was measured potentiometrically at 
pH 7.0, because at pH 8.0 the spontaneous hydrolysis of TC3 reaches 
relatively high levels. The assay was carried out on a mechanically 
stirred solution of substrate in the reaction vessel. Spontaneous hydrol- 
ysis was recorded in the pH-stat mode for 2 min before lipase injection, 
and this background value was subtracted from the activity measure- 
ment. One international lipase unit is the amount of enzyme catalyzing 
the release of 1 ptmol fatty acid/min. Each assay contained a 5-fold 
molar excess of pure colipase. We checked that bovine serum albumin 
(final concentration, 1%) had no detectable catalytic activity on a TC3 
solution (7.7 mM) or on a TC3 emulsion (15.33 mM). Interfacial activa- 
tion assays with p-nitrophenylbutyrate were done as described (22). 

Crystallization. X ray Data Collection, and Processing— SmaW crys- 
tals of rPLRP2 were obtained at room temperature using the hanging 
drop vapor diffusion method, by mixing 2 ^\ of protein (18 mg/ml in 0.2 
M NaCl. 0.1 M Tris-HCl, pH 8.5) and 2 ^1 of the Hampton screen 1 
solution 36 (Hampton Research, Laguna Hills, CA) containing 8% pol- 
yethylene glycol 8000 and 0.1 M Tris-HCl, pH 8.4. CrysUls were im- 
proved by diminishing the polyethylene glycol and protein concentra- 
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Fig. 5. Monolayer activity of PLRP2 
Upases and PTL against various sub- 
strates. Activities were determined in a 
monolayer trough as described under 
"Materials and Methods." The substrate 
legend is given in the figure. n-gPLRP2, 
guinea pig PLRP2 isolated from pancreas; 
gPLRP2, recombinant guinea pig PLRP2; 
hPTL, human PTL isolated form pancre- 
as; cPLRP2, recombinant coypu PLRP2; 
rPLRP2, recombinant rat PLRP2. 
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Activities of lipases against various triglycerides in bulk phase assays 




All assays done 


in pH-stat with excess colipase. 










Specific activity 




Lipase 


Tripropionin" 


Tributyrin Triolein 


Dicaprin^ 


hPTL 
rPLRP2 
gPLRP2 
CPLRP2 


7000 
224 

1000 
850 


units I mg 

8000* 1600* 
3900 1200 
1700^ 
2000^ 


170 
130 



° Measured in gum arabic with no taurodeoxycholate. 
* Measured in 4 mM taurodeoxycholate. 
" Taken from Ref. 39. 
Taken from Ref. 5. 

tions by a factor of two. Larger crystal were obtained by macroseeding 
using a 4+4 /il mixture and 2% polyethylene glycol 8000. 

The crystals were soaked in a synthetic mother liquor containing 
33% ethylene glycol as cryoprotectant and were subsequently cryo- 
cooled using the Oxford equipment (Oxford Crvosystems, Oxford, UK). 
X-ray diffraction data were collected to 1.80 A resolution on a 30-cm 



Mar-research imaging plate at 0.970 A wavelength on beamline DW32 
in LURE (Orsay, France). The data were processed using the DENZO 
software package (23). The rPLRP2 crystallizes in the space group 
with cell dimensions a = 57.4 A, 6 = 79.1 A, c = 60.9 A, and ^ - 102.1'. 
Specific volume calculations yielded one molecule/asymmetric unit, 
with a V„, of 2.7 A^/Da and a solvent content of approximately 54% (24). 



32126 



Structure i Activity of Pancreatic Lipase- related Protein 2 



A total number of 36,060 unique reflections were indexed using the 
SCALEPACK pro-am with an R-factor on intensities of 6.3%, a data 
set multiplicity of 2.4 and a completeness of 97.5%, between 10.0 and 
1.8 A (Table I) (25). 

Structure Determination — The structure was solved with the molec- 
ular replacement method using the AMoRe program (25). The closed 
form of the classical human pancreatic Hpase was used as the search 
model. The rotation function of the two bodies, performed with the 
N-terminal domain without the lid and the C-terminal domain, yielded 
only one significant solution for the entire molecule (correlation coeffi- 
cient of 0.50 and R-factor of 0.39 between 10.0 and 3.5 A resolution). The 
structure was refined using the X-PLOR program (26). After performing 
12 cycles of slow cooling protocol starting at 2000 K and manual re- 
placement and adjustments using the Turbo-Frodo program, the R- 
factor had decreased from 45 to 24.6% (R-free 31.8%) (27). The water 
molecules located in the (2F„ - F^) and (F„ - maps numbered 295, 
and one Glc-NAc sugar bound to Asn^^"* was identified. Seven molecules 
of ethylene glycol were modeled in the density. The R-factor calculated 
with 36,060 reflections between 10.0 and I.80A resolution was 20.3% 
and the final R-free factor calculated with 5% renections was equal to 
24.2% (Table I). The Ramachandran plat and the electron density map 
(see Fig. 1) further demonstrate the quality of the model. Coordinates 
have been deposited in the Protein Data Bank with the accession 
number IbuB. 

RESULTS AND DISCUSSION 

Overall Structure of rPLRP2— The structure of rPLRP2 has 
been refined at the highest resolution observed among the 
pancreatic lipases. The current model consists of 3505 protein 
atoms, one GlcNAc connected to Asn^^'*, one calcium ion, seven 
molecules of ethylene glycol, and 295 water molecules. One 
protein segment, located between residues 405 and 411, was 
found to have no interpretable electron density and was there- 
fore removed from the model. In the Ramachandran plot (Pro- 
check software) of the final model, all of the main chain dihe- 
dral angles but one (Ser^^^) fell within the allowed regions 
(90.5% in the most favorable regions, and 9.3% in additional 
allowed regions) (28). The active site Ser^®^ has the e confor- 
mation found in other lipases and in the a/^ hydrolase fold 
family and is therefore located in a "generously allowed region" 
(28, 29). The glycosylated Asn^^"* is conserved in all PLRP2 
lipases except the coypu. This glycosylation site was not pres- 
ent in the structure of the gPLRP2/hPTL chimera because the 
C-terminal domain originated from hPTL. The PLRPl glycosy- 
lation site at position 138 is not present in PTL or in PLRP2 
lipases (30). Another potential glycosylation site in classical 
lipases is located at position 166 but is only partially conserved 
(14) and is not present in PLRPl or PLRP2 lipases. Due to 
cryocooling, high resolution, and low B-factors, seven molecules 
of cryoprotectant can be observed in the electron density map. 
The ethylene glycol molecules are stabOized mainly by hydro- 
gen bonds and in part by hydrophobic interactions. Hydrogen 
bonds are established with Arg side chains (four cases) and 
with main and side chains of various semi-polar and polar 
residues. Hydrophobic interactions involve aromatics (four 
cases) and aliphatic residues (two cases). 

The rPLRP2 structure belongs to the ot//3 hydrolase fold 
family of proteins (29). The protein consists of two main do- 
mains, a globular N-terminal domain and a /3-sandwich C- 
terminal domain. This structure closely resembles the struc- 
ture of hPTL and the other members of the pancreatic lipase 
fold family (see Fig. 2A) (10). The core of the N-terminal domain 
of the molecule consists of a tightly packed /3-sheet surrounded 
by five helices. The rPLRP2 lid, which adopts a closed confor- 
mation, has an electron density of excellent quality and does 
not show any sign of disorder or particular flexibility (Fig. 1). 
Its B-factors display the same pattern as in other closed pan- 
creatic lipases (data not shown). The Hd is located between the 
two sides of the bridged cysteine residues 237-261. Together 
with the ^5 loop, it adopts the same conformation as that 



observed in the closed structure of hPTL (10). The active site is 
located at the bottom of a hydrophobic crevice that is covered by 
the lid. The catalytic triad (Ser^^^, His^*'^, and Asp^^^) includes 
the nucleophile belonging to the usual consensus sequence 
G-X-(nucleophile)-A'-G (Fig. 2A). 

The structure of rPLRP2 has been superimposed on all the 
known pancreatic lipase structures: hPTL, hPTL bound to por- 
cine cohpase (hPTL-pCol), hPL-pCol in complex with a Cll 
phosphonate inhibitor (hPTL-pCol-Cll), horse pancreatic 
lipase, gPLRP2, dog pancreatic PLRPl, and porcine pancreatic 
lipase bound to porcine colipase (pPTL-pCol). The N-terminal 
catalytic domains of the various molecules superimpose well, 
(within 1.0 A), apart from the mobile loops (the lid and the )35 
loop), which switch between the closed and open conformations. 
The degree of structural homology of rPLRP2 compared with 
the other lipases found in the closed conformation (dog PLRPl, 
hPTL, and horse PTL) is remarkable (Fig. 2B). As described 
previously, the C-terminal domain orientations of the struc- 
tures included in the comparisons were found to differ due to 
rotations of a few degrees occurring around hinge residue 337 
(Fig. 2B) (11-14). 

Active Site Structure ofrPLRP2~'We have proposed a model 
for the putative binding of a triglyceride at the hPTL-pCol-Cll 
active site crevice, based on the hPTL-pCol-Cll complex (13). 
One acyl chain of the triglyceride, the leaving fatty acid, was 
assumed to bind at the position of the Cll conformer 1, 
whereas a second acyl chain was taken to bind at the position 
of the Cll conformer 2 (Fig. 3A). We superimposed and com- 
pared the closed structure of rPLRP2 with the open structure of 
hPTL-pCol-Cll, to investigate, on the basis of our model, 
whether the binding of a triglyceride to rPLRP2 was compati- 
ble with the active site structure. To make these comparisons 
vaHd, a model of the open rPLRP2 has been built. The lid and 
the /35 loop were taken from the open hPTL structure and 
grafted onto the core of the PLRP2 enzyme, and the lid residues 
were substituted according to the rPLRP2 sequence. No resi- 
dues were found to be substituted between hPTL and rPLRP2 
in a 10 A radius around the nucleophilic Ser^'^^ Ov Conse- 
quently, the Cll phosphonate inhibitor positioned into the 
open rPLRP2 model exhibits the same protein contacts as in 
the classical lipase. 

hPTL has two colipase binding sites. One is located at the 
C-terminal domain and was found to bind colipase when the 
catalytic domain is in the closed or in the open conformation 
(11). A second site appears only when hpase opens, yielding an 
interaction of colipase with the open lid (12). We have investi- 
gated the likelihood of both sites with a model of open PLRP2. 
The three lid domain residues involved in the interaction with 
colipase are conserved. Among the 12 C-terminal residues in- 
teracting with colipase, only four are substituted compared 
with hPTL: Phe^^ Tyr, Ile^°^ -> Leu, Tyr^°^ Asn, and 
Glu^4i _^ (3]^) rpi^g I401L substitution does not alter the 
interaction. The E441D substitution may abolish the interac- 
tion with colipase Arg^^, but very limited changes, such as side 
chain torsion, could restore the ion pair. The stacking interac- 
tion of Tyr'^^^ with Arg^^ is also lost but is readily replaced by 
a hydrogen bond between Asn'*^^ and Arg^^. The fourth substi- 
tuted residue, Tyr^®°, clashes severely with colipase Glu'*^, This 
unfavorable interaction of the tyrosine hydroxyl group can be 
easily turned to a favorable hydrogen bond through side chain 
torsion of these two surface residues. To summarize, most 
interactions between colipase and PTL would be conserved in 
rPLRP2, the two unfavorable interactions yielding from sub- 
stitutions can be relaxed easily, and no new unfavorable inter- 
actions appear. This conclusion is consistent with kinetic data 
showing that rPLRP2 does interact with colipase. 
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Finally, the biantennary saccharide is located at position 
334, between the N- and the C-terminal domains, on the en- 
zyme face opposite to the catalytic center. Despite this location, 
it should not interfere with lid opening or with colipase 
binding. 

Interfacial Ac^ii;a^ion— Because it was previously reported 
that rPLRP2 and cPLRP2 displayed no interfacial activation, 
we expected structural differences between hPTL and rPLRP2, 
particularly in the hd domain (5, 8). The results contradicted 
our expectations because the lid domain was neither disordered 
nor in an open conformation. In fact, the closed position of the 
lid in rPLRP2 closely resembled the position of the lid domain 
in PTL (10). The good electron density and closed conformation 
suggested that rPLRP2 should show interfacial activation. 

Because of our findings, we reexamined rPLRP2 for interfa- 
cial activation using the recently validated method with tripro- 
pionin (21, 32). The use of tripropionin overcomes the difHcul- 
ties that accompany the poor water solubility of tributyrin, the 
substrate utilized in previous studies of PLRP2 and interfacial 
activation (5, 6, 8). The experiment was replicated independ- 
ently in two different laboratories under slightly different con- 
ditions (Fig. 3, A and B). At concentrations below the solubility 
limit of tripropionin, rPLRP2 had little activity. The activity 
increased considerably above the solubility limit of the sub- 
strate. Similar kinetics were found with another substrate, 
p-nitrophenylbutyrate, that has been used to demonstrate in- 
terfacial activation for other lipases (Fig. 3C). These results 
clearly show that rPLRP2 possesses interfacial activation pre- 
ferring aggregated substrates over monomeric substrates as 
does PTL. The observation of interfacial activation on tripropi- 
onin restores the validity of the classical explanation: closed lid 
means interfacial activation. Although open lids or disordered 
lid structure have been observed in other lipases, even in the 
absence of inhibitor, and appear to violate this principle, these 
structures were obtained in the presence of less polar solvent or 
in detergent, which may simulate an interface (33—36). 

Inhibition ofrPLRP2 by E600 and Tetrahydrolipstatin— The 
conformation of the rPLRP2 active site and the conserved Ser- 
His-Asp catalytic triad suggested that it should be inhibited by 
lipase inhibitors like E600 (diethyl p-nitrophenyl phosphate) 
and tetrahydrolipstatin (37). To determine whether these com- 
pounds inhibit rPLRP2, we incubated the lipase with both 
inhibitors for various lengths of time and measured activity 
against tributyrin (Fig. 4). Both inhibitors effectively reduced 
the activity of rPLRP2. The inhibition suggests that the cata- 
lytic mechanism of rPLRP2 is similar to that of PTL. If PLRP2 
participates in fat digestion, its activity should be effectively 
decreased by tetrahydrolipstatin. Finally, these results suggest 
that obtaining the structure of rPLRP2 in the open form may be 
possible in the presence of an inhibitor as previously done with 
PTL. 

Substrate Specificity— Bulk phase assays had previously 
demonstrated that PLRP2 lipases have a broader substrate 
specificity than does PTL (5, 6, 8). We extended these observa- 
tions by measuring activity against various substrates using 
the monolayer technique (Fig. 5). We compared the activities 
against three different phospholipase substrates, 1,2-didode- 
canoyl-phosphatidylcholine. 1,2-didodecanoylphosphati- 
dylethanolamine, and 1,2-didodecanoylphosphatidylglycerol; 
one lipase substrate, 1,2-dicaprin; and one galactolipase sub- 
strate, monogalactosyldiglyceride. This is the first use of a 
galactolipid substrate in the monolayer assay. Furthermore, 
we compared the activity of rPLRP2 to those of gPLRP2, 
cPLRP2, and hPTL. All four Hpases had activity against 1,2- 
dicaprin. hPTL was not active against the phospholipid or 
galactolipid substrates. In contrast, all three of thePLRP2 



hpases showed activity against phospholipids as previously 
reported. Like cPLRP2, rPLRP2 shows a clear preference for 
1,2-didodecanoylphosphatidylethanolamine and 1,2-didode- 
canoylphosphatidylglycerol over 1,2-didodecanoyl-phosphati- 
dylchohne (5). The activity of rPLRP2 and cPLRP2 against 
1,2-didodecanoyl-phosphatidylcholine was quite low compared 
^^ith the other two phospholipid substrates. Both rPLRP2 and 
gPLRP2 but not hPTL showed activity against monogalacto- 
syldiglyceride. In addition to confirming the activity of PLRP2 
lipases against galactolipids, this result demonstrates the util- 
ity of the monolayer assay for measuring galactolipase activity. 

Although rPLRP2 activity against these various substrates 
could be easily measured in the monolayer assay, rPLRP2 had 
lower activity than did the other lipases. This finding is con- 
sistent with the results of the bulk phase assay with galacto- 
Hpids where rPLRP2 had decreased activity compared with 
gPLRP2 (9). The lower activity of rPLRP2 against 1,2-dicaprin 
was surprising. In bulk phase assays, the specific activity of 
rPLRP2 compares favorably with that of hPTL (Table II). The 
explanation for this difference was not examined, but the find- 
ing may indicate that rPLRP2 is more sensitive to denaturation 
by the monolayer than are the other hpases or that rPLRP2 
may partition itself less favorably in the monolayer system 
than other lipases. Direct comparisons of bulk phase phospho- 
Hpase activity of rPLRP2 and the other PLRP2 lipases have not 
been done. The monolayer data indicate that rPLRP2 has lower 
activity against phospholipids then do other members of the 
PLRP2 family. 

The current rPLRP2 structure does not explain the different 
activities of rPLRP2 and hPTL against phospholipids and ga- 
lactoHpids. There were no differences in the residues or posi- 
tions of the residues around the active sites of rPLRP2 and 
hPTL to explain the substrate preferences. We observed differ- 
ences in the lid domain and in the j35 loop between the two 
enzymes, but they do not obviously explain the substrate dif- 
ferences when compared with the opeii, active form of hPTL. It 
will be necessary to solve the open structure of rPLRP2 before 
differences in the active sites become apparent. Possibly, resi- 
dues away from the active site will affect substrate specificity 
as found in the serine proteases. 

Concluding Remarks— In this paper, we report the first 
structure for a member of the PLRP2 family and demonstrate 
that rat PLRP2 does show interfacial activation. Additionally, 
we confirm and extend the observations that PLRP2 lipases 
possess broader substrate specificities than do the closely ho- 
mologous pancreatic triglyceride lipases. These studies repre- 
sent the beginning of investigations that will contribute to 
imderstanding the molecular mechanisms underlidng lipolysis. 
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THE HUMAN GENOME 

nf the euchromatic portion of 
A 2.91-bUUon base pair (bp) consensus sequence o ^^^^ ^^^^^^^ , 
the human genome v/as generated by the » 3^^^ over 9 months from 

method Thfl4.8-bUlionbpDNAsequen^^^^^^^^ 

27 271,853 high-quality sequence ' p^A of five individuals. Two 

Sm b^th ends of Pl«-V^"::n"me1ssembly and a regional chron,osorne 
assembly strategies-a whole-genome ^ ^ ^elera and the 

asSmby-were used, each "'^'''"'"S sequen"'';; .^^^ 550.bp 

;Sy funded genome ;',^^J, :,rof hose geCrSe regions th^ been 
segments to create a 2.9-fold coverage of thos g . assembly 

sequenced, without including ^-^^^.'^'^^J^J^V^; thought the effective cov- 
;rocedureusedbythepu.blicJ/f^^^^^^^^ 

eraee in the assemblies to e'ghtfoW. reducing 5 .j^.f^d coverage. The 

Sinai assembly over vvhat wo"ld be obtained wUh 
SS) assembly strategies V'f.^^J^.^^E VfSely "ver the euchromat.c 
^dependent mapping data. The ^"^f ^"/^f ^han 90% of the genome ,s n 
regions of the human chromosomes More tn ^^^^^^ , 

scff?old assemblies of 100.000 ''P/^ '^^he in^"^^ "'l"^"" c ^'^J 
scaSolds of 10 million bp or larger A"3lys^ g corroborating 
7fi 588 orotein-encoding transcripts for which tne genes with mouse 

Snceandanadditional~12,p00c^^^^^^^^^ 

matches or other weak supporting o+C seq"^"^^ separated 

Sous, almost half the genes -^JlfP^^/^^^e^^^^^^ 1.1% of the genome 
S large tracts of apparently "f "S^fv^ith 75% of the genome being 

t/soanned by exons, whereas 24% is m '"t-^^""*' ,„ size up to chro- 
intergeak DNA. Duplications of segm^"^-J /„d reveal a complex 

sequence comparisons between thexonse^^^^ 
remains an open challenge. 



Decoding of the DNA that constitutes the 
?ri"g'enome has been widely anticipa.d 
for the contribution it will make towara 



man condiuon. A proje 



jN^ using chain-terminating nuckotide ana- 
. / jTin .he same year, the first human gene 
S Sia^d^^q-c^d (.). in 19S6. Hood 
^ ^Iworlcers (5) described an improvement 

^vith this new technology 

man genes (P). °ted the develop- 

man EST ^'I'i^l^^^.^/^w analyze 
ment of new computer aigon . 

large amounts ofsequence data, and m^^^^ 
Tht Institute forGenom:cF.sc^^C^^^^^^ 
algorithm was deve oped that penmtte^ 

bly and analysis ^^"tStted Scteriza- 
gJ:nd^oSn^=^^^^^^ 

• da genome sequence w- de U-une^ by^^ 

shotgun rcstnclion ^'f"' ^'^^ f^r sequenc- 
ni). When considerujgmethods 10 4 

Wthe smallpox f^^f Senciiimcth^^^ 
a whole-genome f ow- 
^vas discussed and subsequently ^^^^ 
ina to the lack of appropriate soiw 
ing 10 ""^ .„.„wv However, m 15*>'». 



Genome Project, ""'^^^'^^fofolo ° Penn SUte Uni- 
94720, USA. "Department of BioloE/.ren 

York Avenue. New York. NV W^l ^ 
England Biblabs 32 J-^'%''.^^2a^> Institute 
USA 'Divijlon of Biology, "^evard, Pasa- 

of Technology. 1200 Eas CaWom^^B^^^^^^ 

-J^na- " ?"" " '^.2'^P X. 208000. New 



environment and " of ing to the lack ot aPPJ^J"" i„ 1994, 

•=?tnc\SHe -^^^^^^^^^^ - ^"^nrc^obi^iti-TuSingp 

detemiinrng the comp ^^^^ . . .^^en a ""crobial^g^^ ^^^^ ^ v^tole-genome 

quence of the ^^^ f ^ subsequent 



quence °f ♦^^^^'twH • In subsequent 
'"t'rrd' m'etl^ Seed reactions in 
HdtVfic cLnunity (2). However,^ 
?990>eHumanGeaomeJ^^^^^ 
ofnciallylmuatedrnt^^^^^^^^ ,f 

Tthe US Department of Energy 
"S^TycSsBbmfo/plaBfor completing 
With a 13-year, o:^ . announced 

our to determine the se- 

sequencmg fa«hV^' over a 3-ycar. 

?/.r„^."^^rt &e penultimate mile^ 



Pn) 



^hen a microbial geoom.-.^^-^^^ - -^^ 
was contemplated fJJ^^^^^^ considered 
shotgun sequeacmgapprMCh w 

possible -U^f 

rithm-In 1995. the 1.6 J ^^^^^ by a 
{„y7HC«rae genome ^as comp 
whole-genome st'O'ST^^S Sequent 
{/5).Theexpenence^A^everay.^^^^ 

genome-sequencmg effo^ ^^^^ 

n^^-srrfe^^^^^^^ 

for these m^Saba^^^^^^^ 
nomes was the use 01 p ^y^- 
( Uo c^':l-^rd?tinrJs^ sizes and 



,ool of period. Here we "P^^^ , 4aI -oal a nearly clone l'^''"".'^^.,. Paired-end sequences 

"^^^ fii\';d"r St/eVt*" P.O.'BV>i'268POO. New stonc alohg &e path to^1«^d goa^. » ^ J ^j^^.^g '"^'^''tTt^'Zl hp in length ftom 

^^,^'^ao6S20^8m^^^^ • compWe sr'°'%lome sVcing "*»rTdfubl«S DNA clones o 

SsTu";c?U,CentreDrtve.Foster^^^^^^ .. ,;on of *e h^an |*Xu-g«iome' random both /"^^'^T^uccess of usmg « ^ 

'The Institute for Genocrfe ''Faculty of . ' was performed by a wnoic ,„,_Wv of prescribed Icngtns. lu m g to 20 W . 

cVnter Drive. RocMae ^f'X'^^to^T^^ lot^n method with subsequent assembly P "Slage lambda in 

-.-.-~£=»««^.»r 
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neously map and sequence the human ge- 
nome by means of end sequences "from 150- 
kbp bacterial artificial chromosomes (BACs) 
(77, 18), The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAG end-sequencing (BES) method was ap- 
plied successfully to complete chromosome 2 
from the Arabidopsis thaliana genome {19). 

In 1997, Weber and Myers (2^) proposed 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
* received {21), However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear that the rate of progress 
in hiunan genome sequencing worldwide 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosy stems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quently called the ABI PRISM 3700 DNA 
Analyzer. Discussions betv^-eenPE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed^ at 
TIGR {23), Many of the principles of operation 
of a genome-sequencing facility were estab^. 
lished in the TIGR facility {24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 1 50-fold scale-up from the K influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible {25). The 
Drosophila meJanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and compleX-e'ukaiyotic genome. 
In collaboration with Gerald Rubin and the 
Bericeley Drosophila Genome Project, the nu- 
cleotide sequence bf the 120-Mbp cpchtomaric 
^.-poition of ih€ Drosophila genome -^vis deter:" 
mined over a 1-year period {25-2<Q. The Dro- 
sophila genome-seqUe'ncihg effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chroniosome assemblies 
with highly accurate order and orientation with 
substantially less than 1 0-fold coverage, and pi) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. • ^ 

• These firidings, -together, with the dramatic 
changes in the public genome efforfsubsequent 
to the formation of Celera {29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. Wc initially pro- 
posed to do 10-fold sequence coverage of Ac 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to -S-fold 
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coverage and to use the unordered and unori- 
ented BAG sequence fragments and subassem- 
blies published in GenBank by the publicly 
frinded genome effort {BO) to accelerate the 
project We also abandoned the quarterly an- 
nouncements m the absence of interim assem- • 
blies to report 

Although this strategy, provided a reason- 

• able result very early that was consistent with a 

. whole-genoiTie shotgun .assembly with eight- - 
fold coverage, the human genome sequence is 

• not as finished as the Drosophila genome was 
Vidth an effective 13-foId coverage. However, it 
became clear that even with this reduced cov; 
erage strategy,- Celera could generate an accu- 

' rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was iiutiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the ~3 
biUion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential bias to the fmal sequence from chi- 

• meric" clones, foreign DNA contamination, or 
■ misassembled contigs. Insofar as a correctly 

and accurately assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this. manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also descnbe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. 
Figure 1 (see fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig., 1 on Science Online at 
www.sciencemag.org/cgi/content/fiill/291/ 
"5507/1 304/Dt!)' provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
' pretation of the genome are just beginning. 
To aid the reader in locating specific an- 
alytical sections, we' have divided the paper 
into seven broad Sections. A summary of the 
major results appears at the beginning of each 
section. ; - ' 

1 Sources of DNA and Sequencing Methods 

2 .Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencing 
Methods ^ 

Summaty. This section discusses the rati oruic 
and ethical rules governing donor sclcciion to 
ensure ethnic and gender diversity alon^ with 
the methodologies for DNA extraction and |>. 
braiy constniction. The plasmid library cotv 
struction is the first critical step in shotgim 
sequencing. If the DNA libraries are not uni- 
• form in size, nonchimeric, and do not randoml)- 
represent the genome, then the subsequent s!qn 
cannot accurately reconstruct the genome se- 
quence. We used automated high-lhroui:lipui 
DNA sequencing and the computational iMfni' 
structure to . enable efficient, tracking of ciior* 
mous amounts of sequence information (27.3 
million sequence reads; 14.9 billion bp of se- 
quence). Sequencing and tracking from both 
ends of plasmid clones from 2-, 10-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 
indicates that the accurate pairing rate of end 
sequences was greater than 98%. 

Various policies of the United States and ilie 
World Medical Association, specifically Hjc 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects. We convened an Institutional Re- 
view Board (IRB) {31) that helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the-informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here, Wc 
adopted several steps and procedures to prt»- 
tect the privacy rights and confidentiality ol 
the research subjects (donors). These includ- 
ed a uvo-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
appliedTor'and received a Certificate of Con- 
fidentiality from the Department of HcaUn 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of inc 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 24 1 (d). , 

Celera and the IRB believed that the mi- 
tial -version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors ^yere asked, on a voluntary 
basis, to self-designate an cthnogeograpnu: 
category (e.g., African-American, Chmc>c. 
Hispanic, Caucasian, etc.).^ We enrolled /i 
donors (52). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age. 
sex. and self-designated cthnogeograpnu: 
group. From females, -130 ml of whole, 
heparinizcd blood was collected. From male. . 
--BO ml of whole, heparinizcd blood 
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collected, as well as five specimens of se. , 
collected over a 6-week period. P^niianent 
lymphoblastoid cell lines were f «ated by 
Epstein-Barr virus immortalizatiOQ. DNA 
from five subjects was selected for genormc 
DNA sequeocing": two males arid three te- . 
males— one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and tvvo 
Caucasians (see Web fig. 2 on Science Onlme 
at w%vw.sciencemag.org/cg./content/291/5507/ 
1304/DCl). The decision of whose DNA to 
sequetice was based on a, complex hux ofTac- 
••.to^includingthegoalofachievmg divers^ as 

wel as technical issues such as the quality of 
the DNA Ubraries and availabUity of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quahty p las- 
nld Ubraries in a variety of insert sizes so ft J 
pairs of sequence reads (mates) are obtamej 
one read from both ends of each plasmid msert 
Hish-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 
of clones without inserts, and no contoun^^°^ 

from such sources as the n"'o<=l»°"<^"£sf 
L£ic^erfc;.facoKgenogucDNA.DNAfrom 

each donor was used to constnictplasnud Ubrar^ 

its in one or more of three size classes: 2 kbp, 10 

kbp. and 50 kbp (Table 1) (53). . 

in designing the DNA-sequencmg pro- 
cess we focused on developing a sirnple 
system that could be implemented .n a robust 
and reproducible maniier and momtored et- 
fectively (Fig. 2) {34). _ 

Current sequencing protocols are based on 
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.he dideoxy sequencing method (3S), which 

STr reaction. This limitation on read length has 
Se monumental gains in throughput a pre- 
^nuisite for the analysis of large eukaoTOtic 
eelom« We accomplished this at the Celera 
fSS which occupies about 30,000 square 
St Xratory space and prc^c^ sequenc 
data continuously at a rate of 17^'^,*°^ 
«ads per day. The DNA-sequencmg facility is 
SpoSd by a high-performance, cqmputation- 

-'S^2sforDNAsequencmg\v^m^^^ 
ular by design and automated, Intennodule 
sample backlogs allowed P-cip 
modules to operate independently. (.U " 
brat transformation, plating, and colony 
pSng; (ii) DNA template P«p-^^^^^^^^^^ 
(iii) dideoxy sequencing reaction set up 
purification; an^(iv) 

Tc^ci. module have been carefiiUy 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 
wUhouTa single day's interruption since the 

jLdTon t me^^lurrently estimated at about 
1? min per day. The capillary system also 
Jacmut^s correct associations of sequenc- 
ing traces %vith samples through the e . mi- 
Sibh of manual sample loading and lane- 
. . uSno errors- associated with slab^gels_ 
About 65 production staff were hired and 
Ca °ned, and were rotated on a regular basis 



hrough the four production modules. A 
central laboratory information rnanagement 
system (LIMS) tracked all sample p ates by 
Sque bar code identifiers. The facility was 
supported by a quality control team that per- 
fomed raw material and m-process testing 
and a quality assurance group with responsi- 
bilities^ including d«x:ument control -Rela- 
tion and auditing of the facility. Cnt'cal to 
tke ;uccess of the scale-up was the validation 
of all software and instrumentation, before 
■implementation; and prodiiction-scale testing . 
" of any process changes. • 

1.2 Trace processing 
An automated trace-processing P>P^'';;^^ 
been developed to process each sequence fde 
(57). After quality and vector ^^J- 
average trimmed sequence length was 54-j 
bp. a^d the sequencing accuracy wj expo- 
nentially distributed with a mean of 99.5/. 
"a^d wllh less than 1 in 1000 rea<W less 
than 98% accurate (,26). Each tnmmed se- 
Xce was screened for matches to contam- 
Ss including sequences of vector alone 
genomic DNA, and human mUochondn- 

the human mitochondrial genome. 
13 Quality assessment and control 
The importance of the base-pair level ac- 
- curacy bf the sequence data increases as the 
s^e anfrepetiti^^ nature of the genome to 
be sequenced increases. Each sequence 
read must be placed uniquely m the ge- 



Table 1. Celera-generatcd data Input Into assembly. 



Individual ^^^p 



Numberof^adsfo^ 

10 kbp 50kbp___^ 



Total 



Total number of 
base pairs 



No. of sequencing reads 



Fold sequence coverage ^ , 
. (2S-Cb genome) . . . . 



Fold clone coverage 



Insert size* (mean) 
Insert size* (SD) 
% Matest 



A 
6 
C 
D 

: : f 

• Total 

"a ' 

B 
C 

F 

Total 

A 
B 
C 

D' * 
F 

Total 
Average 
Average 
Average 



- 0 

11.736,757 
853.819 
952.523 

13,543.099 
0 

2^0 
0.16 

0.18 • * 

. 0 
254 
0 

2.96 
022 
024 
0 

3.42 

. - 1,951 bp 
6.10% 
7450 



0 

7,467.755 
881.290 
1.046,815 
. .1,498.607 
10,894.467 



2,767357 
66.930 



2.767357 
19.271.442 
1.735.109 
1,999.338 
1.498.607 
27.271.853 
0.52 
. 3.61 
032 
037 
0.28 
5.1 1 



1,502,674,851 
10,464,393,006 
942.164,187 
1,085,640,534 
813,743,601 
14 808.616,179 
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10,800 bp 
8.10% 
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nome, and even a modest .error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for 
the algorithms described below. Procediiral 
dontrols were established for maintainmg 
the validity of sequence mate-pairs as se- 
quencing reactions proceeded through the 
process, including strict rules built into the • 
LIMS. The accuracy of sequence data pro- 
duced by the Celcra process was validated 
in the course of the Drosophila genome 
project i26y By collecting data for the 
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entire human genome in a single facility, 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and 
Characterization 

Summary, We describe in this section the two 
approaches that we used to assemble the ge- . 
nome. One method involves the computational 
combination of all sequence reads >vi^ shred- 
ded data from GenBank to generate an mdepen- 



dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping inforaiatioii- The clustered data 
were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 
sentially the same reconstruction of assembled 
DNA sequence with proper order and orienta- 
tion. The second method provided sUghtly 
greater sequence coverage (fewer gaps) and 
was the principal sequence used for the analysis 
phase. In addition, we document the complete- 
ness and conrectness of this assembly process 
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Fig. 2. Flow diagram for sequencing P'?^''ZlnI.7t^^^^^^^ 
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Jl^Lt^ LS.^^':^<^X tr«pab'W to exchange 



samples and data with both internal and 
defined quality guidelines. Manufacturirjg ^^^P^X 
quali^ control measures, and responsible parties are inoic 
described further In the text 



and provide a comparison to the pubUc gei. 
sequence, which was reconstructed largely by 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the cuchromatic 
re^ons of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100.000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 miUion bp or 
larger. 

Shotgun sequence assembly is a classic 
example of an inverse problem: given a set 
of reads randomly sampled froin a target 
sequence, reconstruct the order and the po- 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the ~25-fold larger human genome. Celera as- 
sembUes consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by usujg 
Icnown markers. The contigs consist of a col- 
lection of "overlapping seq^ience reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strate©'. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known vdth reasonable precision. This is ac- 
compUshed by observing' that a^alr of reads 
one of which is in one contig. and the other ot 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assembUes did not incorporate aU 
reads into the final ket of repotted scaffolds^ 
This set of unincorporated reads is termed 
"chaft" and typicaUy consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 
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2.1 Assembly data sets 
We used two independent sets of data for our 
assembUes. The first was a random shotgmi 
data set of 27.27 miUion reads of average length 
543 bp produied at Cetera. This consisted 
largely of mate-pair reads from 16 Ubianes 
constructed fromDNA samples taken fron. five 
different donors. Libraries with insert sizes of 2, 
10. and 50 kbp were used. By looking at how 
mate pairs from a library were positioned in 
known sequenced stretches of the genome, we , 
v/ere able to characterize the range of insert 
■ ■■ sizes in each library and determine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencing coverage, and clone cov- 
■ erage achieved by the data set. The clone cov- 
erage is the coverage of the genome m cloned 
DNA, considering the entire insert of each 
clone that has sequence from both «nds. The 
clone coverage provides a measure of toe 
amount of physical DNA coverage of Ae ^- 
nome. Assuming a genome sae of 2-9 Gbp.Jhe 
Celera trimmed sequences gave a 5.1 X cover- 
age of the genome, and clone covet^e was 
3 42X. 16.40X.and 18.84X for the 2-. 10-. and 
50-kbp libraries, respectively, for a total of 
38.7X clone coverage. 

The second data set was from the pubbcly 
funded Human Genome Project (PFP) ^l"^ 
primarily derived from BAC clones (JO). The 
BAG data input to the assembUes came from a 

download of GenBank on 1 September 2000 
(Table 2) totaling 44433 Mbp of sequence 
W data for each B.\C is deposited at one of 
■fdur levels of completion. Phase 0 data are a set. 
• . of gerierally unassembled sequencmg reads 
fror^averyV^ shotgun oftheBAC, typically 

less than IX. Phase I data are unordered as- 
TembUes of contigs, which we caU BAG conUgs 
or bactigs. Phase 2 data are ordered 
of bactigs. Phase 3 d^ta are complete BAC 
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equences. In the past 2 years the PFP has 
focused oa a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase I daU 
from a 3X to 4X light-shotgun of each BAG 

clone. , 

\Vc screened the bactig sequences for con- 
taminants by using the BLAST algonthm 
against three data sets:.(i) vector sequences 
in Univec core {38). filtered for a 25-bp . 
. match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal . 
to the sequence; (ii) the- nonhuman portion 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (39). fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98"^. Wnever 
25 bp or more of vector was found with m 
50 bp of the end of a contig. «^e]'P »P 
the matching vector was "cised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data. 61.0 Mbp firom the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us with^a total of 
4363.7 Mbp of PFP sequence data 20/o 
finished. 75% rough-draft (Pb»« » j ?>' 
and 5% single sequencmg «ads (Phase 0)^ 
An additional 104.018 BAG end-^sequence 
Biate pairs were also dovvnloaded and in- 
cluded in the data sets for both assembly 
processes (75). 
2.2 Assembly strategies . 
Two different apjjroaches to assembly were 
pursued. The first was a whole-genome ^- 
sembly process that used Celera data and fte 
PFP £ta in the fomi of additional synthetic 
shotgun data, and the second was » oo^P^" 
menSLed assembly process that firs parh- 
rioned the Celera and PFP data mto sets 
. localized to large chromosomal segments and 
aien perfomied ab initio shotgun assembty on 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, &e »^ 
data was first disassembled or-sWded-mtoa 

synthetic shotgun data set of ff^p reads fta^ 
form a perfect 2X covering of.the bactigs. This 
Sinl6.05milUon-W'r«^g^Mwere 

sufficient to cover (he genome 2.96X b^ause 
of redundancy in the BAC^«bta vatho^. 
incorporating the biases lnh«e^t r^^^^^ 
assembly process. The combmed data set oi 
• S in reads (8X). and all associated 
^e-pair infomation. were then f^^jected to 
S whole-genome assembly ^i'^^^SJZ 
Tee a reconstniction of the g«'»o'«%^e'tii« 
t location of a BAC ia the ^^Z^^^ 
assembly of bactigs was used in ^ P^*J^ 
Bactigs were shredded into reads ^^^1 
found strong evidence that 2-13% of 
misassembled m- Furlhctmote. BAG location 
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information was ignored because some BACs 
were not conectly placed on the PFP physica^ 
map and because wc found strong evidence that 

Table 2. CenBank data Input into assembly. 
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at least of the BACs contained sequence 
data that were not part of the given BAG {4l\ 
possibly as a result of sample-traclong errors 



Completion phase sequence 



Center 



Statistics 



1 and 2 



Whitehead Institute/ 
MIT Center for 
Genome Research. 
USA 



Washington University, 
USA 



Baylor College of 
Medicine, USA 



Production Sequencing 
Facility, DOE Joint 
Genome Institute, 
USA 



The Institute of Physical 
and Chemical 
Research (RIKEN). 
Japan 



Sanger Centre, UK 



Others* 



All centers combinedf 



Number of accession records 
Number of contigs 
Total base pairs . 
Total vector masked (bp) 
Total contaminant masked 

•.Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 

Average contig length (bp) 

Number of accession records 

Number of contigs 

Total base pairs 

Total vector masked (bp) 

Total contaminant masked 

AveragCcbntig length (bp) 
Number of accession records 
Number,of contigs 
Total base pairs . 
Total vector masked (bp) 
Total contaminant masked 

Average contig length (bg) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) : 
Xdtal contaminant masked (.bP). 
-Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
'-Total vector masked (bp) 
Total contaminant masked 

Average contig length (bp) 
Number of accession records 
Number of conUgs 
; Total base pairs 
Total vector maslced (bp) 
Total contaminant masked 

Average contig length (bp) 



2,825 
243,786 
194,490,158 
1,553.597 
13,654,482 

'798 

19 
2,127 
1.195.732 
21,604 
22.469 



6,533 
138,023 
1,083,848,245 
875.618 
4,417.055 

• 7,853 

3,232 
61.812 
561.171.788 
270.942 
1,476,141 



562 

0 
0 
0 
0 
0 

0 . 

135 
7.052 
8,680.214 
• .22.644 
665.818 

1,231 

0 
. 0 
0 
0 
0 
0 

0 
0 
0 
0 

' 0 
0 



42 
5.978 
5.564,879 
; 57.448 
•575.366 

>> 

931 

3.021 
258,943 
209,930,983 
1,655.293 
14,918.135 

811 



9,079 

1.626 
44.861 
265.547.066 
218.769 
1.784,700 

5.919 

2,043 
34,938 
294,249,631 
162.651 
4,642372 

8.422 

1,149 
25,772 
182.812.275 
203,792 
308.426 
7.093 

4.538 
74,324 
689,059.692 
427,326 
2,066,305 
9,271 

1,894 
29,898 



.363 
363 
48.829,358 
2,202 
98,028 

134,516 . 

1,300 
1.300 
154.214.395 
8,287 
469.437 

126.319 

363 
363 
49,017,104 
4,960 
485.137 

135.033 

754 
■ 754 
'60,975,328 
■ 7.274 
118.387 

80.867 

300 
300 
20,093,926 
2.371 
.27.781 
66,978 

2.599 
2;599 
246,1 18,000j 
25,054 
374,561 
94,697 

3,458 
3,458 



283,3581877 246.474,157 



279,477 
1,616,665 

. . 9.478 

.21.015 
" ' '"409.628 
3360.047.574 
2,438.575 
16,311.664 

8.203 



32,136 
1.791.849 

'71,277 

9.137 
9,137 
835,722,268 
82,284 
3.365,230 

91,466' 
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^^^^^^ 



Ihemwr. National UbcratovCo«"P'"8"^ 



(see below). In short, wc perfotmed a true, ab 
initio whole-genome assembly in which 
took the expedient of deriving additional sc- 
queDC« coverage, but not mate paiis, assembled 
bactigs, or genome locality, from some exter- 
nally generated data. 

■In the compartmentalized shotgun assemblj 
(CSA), Celeta and PFP data were panitioncd 
into the largest possible chromosomal segmcnu 
or "components" that could be determined wii>. 
confidence, and then shotgun assembly w;as op 
plied to each partitioned subset wherein the 
bactig data were again shredded into faux read: 
to ensure an independent ab initio assembly o' 
the component By subsetting the data in thi: 
way the overall cornputational effort was rc 
duce'd and the effect of interchromosomal dupli 
cations was ameUoiated. This also resulted m i 
reconstruction of the genome that was relativcl; 
independent of the whole-genome assembly rc 
suits so that the ^vo assembbes could be con» 
pared for consistency. The^quality of ihc part, 
tioning into components was cn»cial so I ha 
Sferem genome re^ons were not m.«d J. 
eether We constructed components from (i) U. 
foneest scaffolds of the sequence from cac 
BAclmd (ii) assembled scaffolds of data unu,.. 
fo CeS data set The BAG assemblies wcr 
obSXacombiningassemblerthatuscd^^^ 

stretch. Ac more accurately one f J' , 
sSfolds into contiguous components oni 

of sequence overlap and ma^^^^^^^^^^^ 
^^rinn We further visually inspcccca nnu 

- /ouAtothepartitioncd^relevantCc^^^^^^^^^ 
the shredded, faux reads of the partitu 
evant bactig data. 
2 3 Whole-genome assembly 
^'Cnth^s used for whol^^^^^^^^^^^ 

sembly (WGA) -'^^XTI^^^-'^' ' 
enhancements to^^^^H^^'L-nomc report 
sequence of the Drosophila genome 

in detail in (.2S). of o P'P*-"'' 

•IheWGAassembte wns^rts '^. ^^^^^^ 

composed of five P^^^'^l^Jf end KcP 
Overlapper, Umtigger, S«fff |;;,,„cr f.. 
Rcsolver, respectively T^e Sc« , 
and marks all nj.crosatelhte repea ^^^^ 
than a 6-bp element. s^:re; 

known interspersed !«P"^ ^ 'SJlA. M«r' 
ingAlu. Line, and nbosomal DNA^^ 

regions get ^^'^^fJ'Jl^cU ' / 
screened regions do not 8" , / 
be part ofan overlap that mvolves / 

matching segments. 
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The Overlapper compares every read 
against every other read in search of complete 
cnd-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match. . 
Because all data are scrupulously vector- 
trimmed, the Overlapper can insisVon com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
wth 4 gigabytes of RAM. .This took 4 to 5 . 
days in. elapsed time with 40 such macl>ines . • 
' operatiiig in parallel.- . -.. ' . " • - • ; j ■ * 
Every overlap computed above is statisti- 
cally a 1-in-lO^^ event and thus not a coinci- ■ 
dental event. What makes assembly combi- 
natorially difficult is that while many over- 
laps are actually sampled from overlappmg 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constitutmg 
an error if put together. We call the former 
•^true overlaps" and the latter "repeat-mduced 
overlaps." The assembler must avoid choos- 
ing repeat-induced overlaps, especially early 

in the process. , . ; ti 

We achieve this objective- m the Unitig- 
ger. We fu-st fmd all assemblies Q.fjeads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled contigs). Formally, these unitigs are , 
the uncontested interval subgraphs of. the 
graph of all overlaps (42), Unfortunately, al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs arc easily identified because their av- 
erage coverage depth is too hi^Jo be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives 'the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consistmg of two or more , 
copies. The discriminator, set to a sufficiently 
-^stringcnt threshold,-idcntifies a subset of the 
unitigs that we are. certain are correct In 
addition, a second; -less stringent threshold 
identifies a subset of remaining udligs very 
likely to be correctly assemble^ pf which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of imique DNA that ,arc >2 kbp 
long. We arc further able to identify the 
boundary of the start of a repetitive clement 
at the ends of a U-unitig and leverage this so 
thai U-unitigs span more than 93% of all 
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singly interspersed Alu elements and other 
lOO-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair infonnation to link these to- 
aether into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are al a certain distance and 
orientation with, respect to each other, the ; 
probability^of -this being N^Tong is agam 
roughly 1 in 10»°. assuming that mate paurs 
are false less than 2% of the time. Thus, one 
can ynth high confidence link together all 
U-unitigs that are jinked by at least t\vo 2- or 
10-kbp mate pairs producing intermediate- 
sized scaffolds that are then recursively 
linked together by confirmmg.50-kbp mate 
pairs and BAG end sequences. This process 
Yielded scaffolds that are on the order of 
megabase pairs in size ^.ith gaps beUveen 
their contigs that generally correspond to re- 
oetltive Clements and occasionally to smal 
sequencing gaps. These scaffolds reconstruct 
the majority of the unique sequence withm a 

^^'p^^the Drosophila assembly, we engaged 
in a . three-stage repeat resolution strategy 
where each stage -was progressively • more 



fi -tlX Cele ra Reads 
• "^QY mate pairs 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the first "Rocks" substage where 
all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap This was done widi the condition that 
^vo or more mate pairs with one of theu: 
reads already in the scaffold unambiguously 
place the unitig in the given gap. We estimate 
the probability of bserting a unitig mto an 
incorrect gap with this strategy, to be less than 
10"'' bdsed on a probabilistic analysis. 

We revised the ensuing VStones" substage 
of the human assembly, making it . more like 
the mechanism suggested in our earlier work 
(4 3) For each gap, every read R that is placed 
In the gap by virtue of its mated pair M bemg 
in a contig of the scaffold and implying R s 
placement is collected. Celera's ^^^^^'P^^^ 
^formation is correct more than 99% of he 
time. Thus, almost every, but not all of the 
reads in the set.belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads the 
gap: eliminating any reads that conflict with 
5ie assembly. This operation P^^^^^^^^^^^ 
more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
: simulated shotgun data set of human chromo- 
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C Combining ^ 
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; .WGA Assembly CSA Assembly ,3,^ oval denotes a computation 
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some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assembled BAG data that cover 
the gap. We call this external gap *\\'alking." 
We did not include the very aggressive 'Teb- 
bles" substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- 
spersed elements whose quality TR-as . only • 
99.62% correct. We decided that for the hu- 
man genome it was philosophically better not 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number -of gaps of some- 
what larger size. 

. At the final stage of the assembly process, 
' and also at several intemiediatc points, a 
consensus sequence of every contig is pro- 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, Vrlth quality- 
valuc-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
Consensus generation uses Celera data when- 
ever it is present In' the event that no Celera 
data cover a given region, the B.AC data 
sequence is used. 

A key element of achieving a.AVGA of the 
human genome \^'as to parallelize the Overlap- 
per and the central consensus sequence-con- 
stmcting subroutines. In addition, memory was 
a real issue— a straightforward iapplicadon of 
the software we had built for Z>r^Jo;?;iaff would 
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have required a computer with a 600-gigabyt£ 
RAM. By making the Overlapper and Unitigger 
incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 ^gabytes of RAM. Moreover, the 
incremental nature of the first three stages al- 
lowed lis to continually update the state of this 
part of the computation as data were delivered 
and then perfomi a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 

-sired. -For our assembly operatioiis- ^tbe total 
compute infiastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 
cluster (Compaq's ES40, Regatta) and a 16- 
processor NUMA machine with 64 gigabytes 

. of memory (Compaq's GS160, Wildfire). The 

. total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling 2.848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
set of reads not incorporated in the assembly, 
numbered 1U7 million (26%). which is con- 
sistent with our experience for Drosophila. 
More than 84% of the genome was covered by 
scaffolds ■>100 kbp long, and these averaged 
91% sequencis and 9% gaps with a total of 
2 297 Gbp of sequence. There were a total of 
93,857 gaps among the 1637 scaffolds >100 
kbp The average scaffold size was 1.5 Mbp, 
the average contig size was 24.06 kbp, arid the 

' average gap size was 2.43 kbp, where the dis- 



tribution of each was essentially exponentia] 
More than 50% of all gaps were less than 50C 
bp long, >62% of all gaps were less than 1 kbj 
long, and no gap. was >100 kbp long. Similax-- 
ly, more than 65% of the sequence is in contig3 
>30 kbp, more than 31% is in contigs >lCv 
. kbp, and the largest contig was 1.22 NIbp lon^ 
Table 3 gives detailed summary statistics for 
the structure of this assembly with a direa 
comparison to the compartmentalized shotgun 
assembly. 

2.4 Compartmentalized shotgun 
assembly 

In addition to the WGA approach, we pur- 
sued a localized assembly approach that was 
intended to subdivide the genome into seg- 
ments, each" of which could be shotgun as- 
sembled individually. We expected that tius 
would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
tics for calculating U-unitigs. The compart- 
mentalized assembly process involved clus- 
tering Celera reads and bactigs into large, 
multiple megabase regions of the genome, 
and then running the WGA assembler on the 
Celera data and shredded, faux reads ob- 
tained from the bactig data. 

The first phase of the CSA strategy was to 
separate Celera reads into those that matched 
the BAG contigs for a particular .PFP BAC 
entry, and those that did not match any public 
• data. Such matches must be guaranteed to 
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No. of bp In scaffolds 

(including Intrascaffold gaps) 
No. of bp In contigs 
No. of scaffolds 
No. of, contigs ' I 
No. of gaps . \i 

No.of gaps- :S1 kbp \ 
Average scaffold slie (bp) 
Average ciptig sire (bp) 
Average Intrascaffold gap size 

Largest contig (bp) 
% of total contigs 

No. of bp In scaffolds 

(including Intrascaffold gaps) 
No. of bjp> In contigs 
No. of scaffolds . { . 
No. of contigs 
No. of gaps 
No. of gaps ^1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average Intrascaffold gap size 

Largest contig (bp) 
% of total contigs 



2.905,568.203 



2.653.979.733 
53.591 
170,033 
116.442 
72.091 
54.217 
15.609 
2,161 

1.988321 
100 

2.847,890390 

2.5^6.634.108 
v'* 118.968 
221,036 
102.068 
62356 
23.938 
11,702 
2.560 

1.224,073 
100 



2.524.251302 
. 2.845 
. . 112.207 
109.362 
69.175 
966.219 
22,496 
2.054 

1.988.321 
95 



2,491.538372- 
1.935 
107,199 
• 105,264 
67.289 
1395.602 
23.242 
1,985 

1,988,321 
94 



Whote-genome a^embty I 
2,574,792.618- ... 2.525334.447 



2334343339 
2,507 
99.189 
96.682 
60343 
1,027.041 
23334 
2.487 

i;224.073 
90 



2.297,678.935 
1.637 
95.494, 
93.857 
59,156 
1,542.660 
24.061 
2,426 

1,224.073 
89 



2.489.357.260 



2,320.648.201 
1.060 
93.138 
92,078 
59315 
2.348.450 
24.916 
1.832 

1388.321 
87 

2328.535.466 

2.143,002,184 
818 
84.641 
• 83.823 
54.079 
2,846.620 
25319 
2,213 

1,224,073 
83 



2,248.689.125 

2.106.521.9O2 
721 
82.005 
81.2S3 
53354 
3.118.84S 
25.6e5 
1.745 

1388321 
75 

2,140343.032 

1383.305.432 
554 
76.28= 
75.731 
49.592 

3.86431 S 
2535S 
2.0S2 

1.224.073 
7: 
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jrop«rly place a Celera read, so all reads were 
ust masked against a library of comrnoii 
epetitive elements, and only matches of at 
east 40 bp to unmasked portions of the read 
constituted a hit. Of Celera's 27.27 null.on 
reads, 20.76 million matched a bac^g and 
another 0.62 million reads, which d.d not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig s 
BAC because their mate matched the bactig. 
Sf the remaining reads. 2.92 million were , 
coiiipletely screened out and so could pot b.q 
matched, but the other 2.97 milhon rea<k had 
unmasked sequence toUhng 1.189 Gbp ttat 
^e not found in the GenBank data set 
Because the Celera daU are 5.1 1 X redmidant, 
v/e estimate that 240 Mbp of umque Celera 
sequence is not in the GenBa|& dato set 

\ the next step of the CSA P«>c"s. * 
combining assembler took the relevant 5X 
CeTerareadsandbactigsforaBACentiy.cmd 

produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstnictions were a transient result whose 
utility was simply to provide more reliable 
S<;L«on for'4 P-I»«s of the. tj-g 
into sets of overlapping and adjacen scaffold 
sequences in the next step. In outlme, the 
combining assembler first exammes the set of 
matching Celera reads to detemime-if there 
are excessive pileups mdicative of un- 
screened repetitive elements. Wherever flie^^^^ 
occur, reads in the repeat region whose mates 
have ^ot been mapped to consistent pos.tipns _ 
are removed. Then all sets of mate pa«s that., 
consistently imply the same relative position 
ofU bactigs'are bundled into a Lnk and . 
weighted according to the number of mates m 
Ae bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bmidl^' f 
mate-pairs in orderoftheirweight-Aselected 

mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy descnbed 
above for the WGA assembler 

The GcnBanlodata for the Phase J:and 2 
B AGS consisted.of an average o"9,§.l»'^^'f . 
per BAC of average s'ize |099 bp. Applica- 
tion of the combining, assembler resulted m 
individual Celera BAC assemblies be.ing put 
Sge*er into an average of 1.83 scaffoK^ 
(rnedian df 1 scaffold) ~ns.stmg of an aver- 
age of 8.57 contigs of average sue 18.973 bp. 
1^ addition to defining order and onentat.on 
of the sequence firagments, there were 57 /« 
fewer gaps in the combined result For Phase 
0 data, the average. GenBank entry consisted 
of 91.52 reads of average length 784 Dp. 
Application of the combining ^embler re- 
sulied in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 



assembly took place, but not enough CeU« 
daU were matched to tnily assemb e the 0.5 x 
to IX data set represented by the typ^a 
Phase 0 BACs. The combining assembler 
also applied to the Phase 3 BACs for 
SNP identification, confirmation of assem- 
bly, and localization of the Celera rea^^The 
phie 0 data suggest that a combmed whole- 
cenome shotgun data sit and IX light- hot- 
!^ of BACs^vill not yield good assembly o 
B^C regions; at least 3 X light-shotgun of 
each BAG is needed. . . : . ^ • 

• The .5.89 imllion Celera fragments not 
matching the GenBank data assembled 
with our whole-genome assembler. The as 
Smbly rerulted Ui a-set of scaffolds to'^S 
442 Mbp in span and consistmg of 326 Mbp 
iJsequence. More than 20% of the scaffold^ 
were >5 kbp long, and these averaged 63>^ 
sequence and 27% gaps with a total of 302 
S of sequence. All sciffolds >5 kbp were 
^^dedllong with all -affoWs F^u^^^^^ 
by the combining assembler to the subse 

^"ttSt^rwetypicallyhadoneor^ 

■ scaffolds for every BAC region constitutmg 
aflSt 95% of the relevant sequence, and a 

collection of disjoint Celera-mque scaffcj^- 
The next step in developing the genome com- 
Snents was to determine the order and over- 
fap Sof these BAC and CcUra_ We 
scaffolds across the genome 

■ "Sd Celera's 50-kbp mate-pairs informanon 
rd BAC-end pairs (18) andsequence Ugged. 

■ Se (STS) markers {44} to provide long- ■ 
" Sge guidance and chromosome sepa^on^ 

(S^^nti^e relatively manageable numto of 

• Solds. we chose not to produce this tUng 

to a wTy automated manner, but to compute., 

^IJfial tiling with a good » -d^^en 

use human curators to resolve disc epajc^s 

or missed join opportumties. To this end, we 

Sped a grapWcal user interface ^at dis- 

«f,ved the f^aph of tiling overlaps and the 
played the grap ^^^^^ 

evidence for each. A num<ui . • <.« 

of 2.922 Gbp. r«;A. we • 

In order to generate the f^^^-^f^^' 

r-^rrLTSew^?;^^^^^^ 

KSiJarreshred^d^t^ 

2e»"^ed?mto»^^^ 
TiVfL MtA By using faux reads rather 
baSfgs Se «sembi algorithm could 
S^ecfeS^ to the assembly of bactigs and 
~rvc c Wric content in a PFP daU enter. 



C aerie or contaminating sequence (from 
another part of the genome) would not be 
incorporated into the reassembly of the corn- 
ponent because it did not belong there, to 
effect, the previous Steps in ttie CSA pr<Kess 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, whercm we 
Sied the assembler used for WGA to pro- 
duce an ab bitio. assembly of the re^on^ 
. WGA assembly of the component result . 

ed ^aset of scaffolds totaling 2 906 Gbp m .. 
Zm and consisting of 2-654 Gbp of se- 
quence. The chaff, of set of reads not mcor- 
porated into the assembly, numbered 6 17 
Si^n or 22%. More than 90.0%. of the 
Sir; was covered by scaffolds sp^g 
>100 kbp long, and th«e averaged 92.2/o 
sequence and 7.8% gaps with a total of 2 492 
Gbp of sequence. -f-^X tht 

105,264 gaps among the 107,19^ ^ oo 
belong to the 1940 scaffolds spanmng >100 
kbp. The average scaffold size wa^ lA Nftp. 
&e average contig size was 23.24 kbp. and 
t average gap size was 2.0 kbp v.here each 
distribution of sizes was exponential. As 
. tcraveragcs tend to be underrepresentative 
St'e majoW of the data. Fi^^^^h^^ 
histogram of the bases m scaffolds of various 
Sze SSes. Consider also that more than 
49%Sll gaps were <500 ^P^^^^'J^;;, 

Aan 62% of all gaps were <1 J^^P'^^^.^^^ 
gaps are<100 kbp long. Similarly.moreJ^ 

73% of the sequence is in contigs > 30 kbp 
Lol ft- 49% is i« contigs >100.kbp.^d 
.*elargcstcontigwasl.99Mbplong^Tabk3 
provides summary statistics for the sttuchire 
Tf this assembly with a direct comparison to 
the WGA assembly. 



2.5 comparison of the V/GA and CSA 
scaffolds 

Having obtained two assemblies of Je hu- 
Senome via independent computaUonal 
nr^esses (WGA and CSA). we compared 
Slds from Ae two assemblies as anofliet 
means of investigating Vr^^lach L- 

sraXt^f^r^^^^^^^ 

""a Uast 1000 f^g-"'^ (Sclera sequenc- 
SI reads or bactig shreds) w^ otemed ^us 

Counted to 2218 ^^^^ ^^^'^^^t^Z' 
CSA scaffolds, for a total 2f " Obp^ 
• 2 474 Gbp. The sequence of each reference 
sXd was compared to the sequence^f^^^ 

scaffolds from the other assembly Jf^tJ 
u Sared at least 20 fragments or at le^ 20^; 
Lf the fragments of the ^-^-^['^^fi.Z 

each such --Pf "'J.'^orS^^^^^^^ 
200 bp with at most Z/o im*u*» 

^^'promthis tabulation, we estHnatedJje 
number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered bylhe CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any uniqueness of the matching segments. • 
Thus, another analysis was conducted in. 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they 
were- confirmed by other matches having a: 
consistent order and orientation. This gives 

' some measure of consistent coverage: 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 
measure. ^ i 

The comparison of WGA to CSA also 
permitted evaluation of scaffolds for stmctur- 
al inconsistencies. We looked for instances m 
which a large section of a scaffold from one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the 
full length of the overlap implied by the 
matching segments. AA.initial set of candi- 
dates was identified automatically, and then 
each candidate was inspected by hand. From 
this process, we identified 31 instances in 
which the assemblies appear to disagree m a 
nonlocal fashion. These cases arc -beirig fur- 
ther evaluated to determine which assembly 

■ is in error and why. • • " 

In addition,* we evaluated local inconsis- 
tencies of order or orientation. The folloy-ing 
results exclude cases in which one contig m 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on thTorder of hundreds 
of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0,ll%);ln the WGA assern^ly were incon- 
sistent \yith the CSA asseoAiy. 
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The CSA assembly was a few percenUge 
points better in terms of coverage and slightly 
more consistent than the WGA, because it 
was in effect perforniing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
Assembly of a gigabase-sized problem. When 

• one considers the increase of two-and-a-half 
orders of magnitude in problem size, the m- 

. formation loss between the two is remarkably 
small Because CSA was logisticaliy easier to 
deliver and the better of the Uvo results avail- * 
able at the time when do%vnstream 'analyses 
needed to be begun, all subsequent analysis 
was performed on this assembly. ' ' 

2.6 Mapping scaffolds to the genome 
The final step in assembling the genome was to 
order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components from 
CSA. These grouped scaffolds were reordered 
by examining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on havmg 
reUable high-resolution map infonmation such 
that each scaffold will overiap multiple mark- 
ers There are two genome-wide types of map 
information available: high-density STS maps. 

• and fingerprint maps of BAG clones developed 
at Washington University (^i). Among the ge- : 

• nome ^vidf STS maps. GeneMap99 (GM99) 
has the most markers and therefore was most 
useful for mapping'scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bel- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On tije 
other hand, GiM99 should have a more reliable 
long-range order, because the framework mark- 
er^ were derived'fiom weU-validated genetic 
maps. Both types of maps were used as a* 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly -but^ they 'did not determine the order of 
sequences produced by the assembler. 




— I — «==■ — T — ' ' . * ^ 4 ftjK mMb 5-10 Mb >10Mb 

<30kb 3(V50kb SO-lOOkb lOO-SOOkb 0.^1 Mb l-SMb 

Scaffold Steo 

. «M • c«ffh^rsAForcachrangeof5caffold5iics.theperc€ntoftotal 
Fig. 5. Distribution of scaffold sues of the CSA for eacn rci & 
sequence Is Indicated. 



In order to determine the effectiveness of 
the fmgerprint maps and GM99 for mapping 
scaffolds, we first examined the reliability of 
these maps by comparison with large scaf- 
folds. Only 1% of the STS markers on the 10 
largest scaffolds (those >9 Mbp) were 
mapped on a different chromosome on 
GM99. Two percent of the STS markers dis- 
. agreed in position by more than five frame- 
work bins. However, for the fingerprint 
maps, a 2% chromosome discrepancy was 
observed, and on average 23.8% of BAC 
locations in the scaffold -sequence disagreed 
with fingerprint map placement by more than 
five BACs. When fiirther examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10. 
scaffolds, indicating this there is variation in 
the quality of either the map or the scaffolds. 
All four scaffolds were assembled, as well as 
the other sbc, as judged by clone coverage 
analysis, and showed the same low discrep- 
ancy rate to GM99, and thus we concluded 
that the fingerprint map global order in these 
cases was not reliable. Smaller scaffolds had 
a higher discordance rate with GM99 (4.21% 
of STSs were discordant by more than five 
. framework bins), but a lower discordance rate 
with the fmgerprint maps (11% of BACs 
- disagreed with fmgerprint maps by more than 
five BACs). This observation agrees with the 
clone coverage analysis (46) that Celera scaf- 
fold construction was better supported ^by 
long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scat- 
folds on the basis of the markers (BAC or 
STS) on these maps. Where the order of 
scaffolds agreed Ween GM99 and the 
WashU BAC map, we had a high degree ot 
confidence that that order was corrcc^ftese 
scaffolds were termed "anchor scaffolds. 
Only scaffolds with a low overall discrepancy 
pate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their firework orders. OrienteUon of 
individual scaffolds was detemuncd by the 
presence of multiple mapped markers with 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nwn*e in anchored scaffolds, more than 99 /o 
of which are also oriented (Table 4). Because 
■ GM99 is oflowcr resolution than the WasW 
map, a number of scaffolds .v^^<>ut \TS 
matches could be ordered relative to the an- 
chored scaffolds because they 
quence from the same or ^^^^^^""^ ^ff^^^^^ 
the WashU map. On the other hand, because 
of occasional WashU global ordermg^^^^^ 
crepancies, a number of scaffolds deterrnmed 
to be •^mmappable- on the WashU map could 
be ordered relative to the anchored scaffolds 
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».-ith GM99. These scaffolds were termed 
'ordered scaffolds." We found that 13.9% of 
he assembly i:ould be ordered by these ad- 
Utional methods, and thus 84.0% of the ge- 
lomc was ordered unambiguously. 

Next, all scaffolds that could be placed, 
3ut not ordered, bet^veen anchors were as- 
,ianed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the satne BAG cannot be 
ordered relative to- each othc^, but can be . 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could only be assigned to a genenc 
chromosome location. Using the above ap- 
proaches. -98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2% of the genome, * 
were distributed evenly across the genome 
By dividing the sum of immapped. scaffold 
Icn^nhs with the sum of the' number of 
mapped scaffolds, we arrived at an estunate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the ^ 
chromosome. 

During the scaffold-mapping effort, we en- 
countered many problems that resulted in addi- 
tional quality assessment and validation analy- 
sis. At least 978 (3% of 33.173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent vath the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CS A asseinbly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pscudogciJes. 

Because of the" time .required for .an ex- 
haustive search for a perfect- overlap. CSA 
generated 21,607 inttS^caffold gaps where 
the mate-pair data suggested that the contigs 
should overiap. but no overiap was found. 
These gaps were defmed as a fixed 50 bp m 
length and make up 18.6% of the total 
116,442 gaps in the CSA assembly. 

We chose not to lise the order of exons. - 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale, for not us- 
ing this data was that domg so would have 
biased certain regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene dcfmition processes more difficult. 
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^.7 Assembly and validation analysis 
We analyzed the assembly of the genome 

• from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of tne 
order and orientation and the consensus se- 
quence of the assembly). 

Completeness. Completeness is defmed as 
the percentage of the euchromatic sequence 
represented in the assembly. This cannot be 

• known with absolute certainty until ,the eu- 
••chromatin sequence has beea completed. 
However, it is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps; (ii) <=^v^^S%f 
published chromosomes, 21 and 22 (48, 4i^), 
knd (iii) analysis of the percentage -of an 
independent set of random sequences iSJi> 
markers) contained in .the assembly. The . 
whole-genome libraries contam heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be m- 
stances of unique sequence embedded m re- 
gions of heterochromatin as were observed m 
Drosophila (50,51). 

The sequences of human chromosomes / 1 
and 22 have been' completed to high quality 
and published (48, 49),, Although tius se- 
quence served as input to the assembler, the 
finished sequence was ^^^^ded into a shot- 
gun data set so that the assembler had the 
-bpportunity.to assemble it differeDlly from 
. the original sequence in the case of structural 
nblymoipliisms or assembly errors m the 
BAG data. In particular, the assembler must 
be able to resolve repetitive elements at the 
scale of components (generally multunega- 
base in size), and so this comparison reveals 
the level to ^hich the assembler resolves 
repeats. In certain areas, the. assembly stnic- 
Je differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
•Tmished" sequence differently on the basis 
of Celera data resulted m an assembly with 
more segmerits than the chromosome 21 and 
22 sequences. We exarnined the reasons why 
there arc inore gaps in the Celera sequence 
than in chromo^ofnes 21' and 22 and expec 
that they may be typical of gaps in other 
' egions of the genome. In the Celen. assem- 
bly ihere are 25 scaffolds, each containing at 
least lOkbofsequence,:thatcollechvely.spaii 
94 3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaimng m the 
Celera assembly for these two chromqspmes 
is 3 4* Mbp. These gap sequences were ana- 
gzed by RepeatMasker and by searc^^^^^^ 
against-the entire genome assembly (52> 
About 50% of the gap sequence cous^^^^^^ 
common repetitive elements identified by Rc- 
;:Sasker; more than ^^alf of the r^^^^^^^^^^ 
was lower copy number repeat elements - 
A more global way of assessing complete- 



ness IS to measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48.938 STS markers from Genemap99 
(51) to the scaffolds. Because these markers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
pleteness. ePCR (53) .and BLAST (54) were 
used to locate STSs on the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
(5 4%) were found by searching the^ unas- 
-sembied dita' or "chaff.^^ We identified 1283 
STS maricers (2.6%) not found in cither Celera 
sequence or BAG data as 6f September 2000, 
raising the possIbDity that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98,9% cover- 
age. Similarly, we compared CSA against 
36678 TNG radiation hybrid markers (^:>a) 
using the same method: We found that 32,371 
markers (88%) were located in the mapj^d 
CSA scaffolds, wth 2055 Diarkei^ (5.6/o) 
fouiid in the remainder. This gave a 94/o cov- 
erage of the genome through another genofne- 

wide survey. , ^ . 

Correctness. Correctness is defmed as the 
stmctural and sequence accuracy of the as- 
sembly. Because the source sequences for the 
- Celera data and the GenBank data are from 
different individuals, we could not directly 
compare the consensus sequence of the as- 

.Table 4. Summary of scaffold 'T^W-ni Sc^ 
• ■ were mapped to the genome w»th different levels 
' .ofconndeLtanchoredscaffoldshavet^^^^^^^ 
■ conHdence; unmapped ^«ffolds have the^^^^^^^^ 
-Anchored scaffolds were consistent^^ ^At.f 
the WashU BAC map and CM99. Ordered scaf- 
fddswere consistently ordered by at least one o 
he following: the WashU BAC "^^P- ^^^^^^^^ 
component tiling path. Bounded scaffolds had or 
der conflicts between at least two of^e e^^^^^^ 
maps, but their placements were adjacent to a 
neighboring anchored or ordered scaffold. Un- 
Zfped scaffolds had. at most, a chromosome 
Tssfgnment. The scaffold subcategories are given 
below each category. 



Mapped 
scaffold 
category 



Number Length (bp) Total 
length 



Anchored 
Oriented 
Unoriented • • 

- Ordere'd 
Oriented 
Unoriented 

Bounded 
Oriented 
Unoriented 

Unmapped 
Known 

chromosome 
Unknown 
chromosome 



1.5Z6 

I. 246 
280 

2.001 
839 

1.162 
38.241 

7.453 
30.788 

II. 823 



1.860.676.676 70 
1.852.088,645 70 
8.588.031 0.3 
369,235.857 
329.633,166 
. 39.602.691 
368.753.463^ 14 
274.536.424 10 
94.217.039 4 
55.313.737 



14 
12 
2 



281 2.505,844 0.1 
11,542 52.807,893 2 
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scmbly against other finished sequence for 
determining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the imderlying reads. 

The structural consistency of the assembly 
can be measured by mate-pair analysis. In a 
correct assembly, every mated pair of se- 

' quencing reads should be located on the con- 
sensus sequence with the correct separation . 
and orientation between the pairs. A pair is 
tcnncd 'Valid'* when" the reads are in' the . 
correct orientation! and the* distance between 
them is within the mean ± '3 standard devi- 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A * 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is tenned "mis- 
separated" when the distance between the 
reads is not in the correct range but the reads 
arc correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined • as described 

. above. To validate these, wc examined all 
reads mapped to the -finished sequence of 
chromosome 21 (48) and determined "how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- ^ 
merism (two different segments of the ge- : 
nome cloned into the same plasmid), and how : 
tight the distribution of insert sizes was for 
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those that were correct (Table 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 
length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp Ubraries con- 
tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 

; ('-10%).-Thus, although the mate-pair infor- 
mation was not perfect, its accuracy was such 

: that measuring valid, misoriented, and mis- 
separated pairs with respect to a given assern- 
bly was . deemed to be a reliable instrument 
for validation pmposes, especially when sev- 
eral mate pairs confirm or deny an ordering. 
The clone coverage of the genome was 

• 39X, meaning that any given base.pair was,; 
on average, contained in 39 clones or, equiv- 

• alently, spanned by 39 mate-paired reads. 

Areas of low clone coverage or areas with a 

high proportion of invalid mate pairs would 

indicate potential assembly problems. We 

computed the coverage of each base in the 

assembly by valid mate pairs (Table 6). In 

summary, for scaffolds >30 kbp in lengdi, 

less than 1% of the Celera assembly was in 

•regions of less than 3 X clone coverage. Thus, 

more than 99% of the assembly, including 

order and orientation, is strongly supported 

by this measure alone. 

We examined the locations and nimiber of 

"all im'soriented and misseparated mates. In 

- addition to doing this analysis on- the CSA 

• assembly (as of 1 October 2000), we also 
performed a study of the PFf assembly as of 



5 September 2000 (30, 55b), In this laucr 
case, Celera mate pairs had to be inapped to 
the PFP assembly. To avoid mapping errors 
due to high-fidelity repeats, the only pairs 
mapped were those for which^ both reads 
matched at only one location with less than 
6% differences. A threshold was set such thai 
sets of five .or more simultaneously invalid 
mate pairs indicated a potential breakpoint, 
where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology- Blue tick marks in the 
panels indicate breakpoints. There were a. 
similar (small) number of breakpoints on 

/both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped reliably. Figures 

. 6 and 7 and Table 6 illustrate the mate-pair 
differences and breakpoints between the tvvo 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs ui 
the large-insert libraries (50 kbp and BAG 
ends) than in the smaU-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
simply because they span a larger segment ot 
the genome. The. graphic - companson be- 
tween the two assemblies for chromosome 8 
(Fig 6 B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to 
the published sequence of chromosome 21. Each rnate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two n^^^^/ ^ad incorrert reUtiv^^^ 

tion or placement, they were considered invalid (number of Invalid maie 

pairs). 



Chromosome 21 



Genome 



Ubraiy 
type 



Library , 
no. 



Mean 
Insert 

size 

(bp). 



2 kbp 
10 kbp 

50 kbp 



BES 



Sum 



1 

•2 

..•3./ 
"'4 
5 
6 

7 . 

8 

9 

•10 
11 . 
12 
13 
14 
15 
16 
17 
18 
19 



2,06-1 
1.913 
2.166 
11.385 
14,5^3 
.9.635 
16:223 
64,888 
53.410 
52.034 
52.282 
46.616 
55.788 
39.894 
48,931 
48.130 
106.027 
160.575 
164.155 



SD 
(bp) 


50/ 
mean 
(%). 


106 


5.1 


' 152 


7:9 


175 


8.1 


851 


7S 


1.875 


12.9 


1.035 


10.7 


928 


9.1 


2.747 


4.2 


5.834 


10.9 


• 7312 


14.1 


7,454 . ■ 


14.3 




15.8 


■ 10,099 • ■ 


18.1 


5.019 


12.6 


9.813 


20.1 


4,232 


8.8 


27.778 


26.2 


54,973 


34.2 


19.453 


11.9 



No. of 
mate 
-pairs 
tested 



No. of 
Invalid 
mate 
pairs 



invalid 



Mean 
Insert 
size (bp) 



50 
(bp) 



50/ 
mean 

(%) 



3.642 
28,029 
4.405 
. .4.319 
* 7.355 
5.573 
" 34,079 
16 
914 
5.871 
2,629 
2.153 
2.244 
199 
■ 144 
195 
330 
155 
642 
102.894 



38 
413 
57 
80 
156 
109 
399 
1 

.,170 
569 
213 
215 
249 
7 
10 
14 
16' 
8 
44 
2.768 
(mean «= 2.7} 



1.0 
1.5 
1.3 
1.9 
2.1 
2.0 
1.2 
6.3 
18.6 
9.7 
8.1 
10.0 
11.1 
3.5 

6.9 
7.2 
4.8 
5.2 
6.9 
2.7 



2.082 
1.923 
2.162 
11370 
14.142 
■ 9.606 
10,190 
65,500 
53.311 
51.498 
52.282 
45,418 
53.062 
36,838 
47,845 
47.924 
152.000 
161.750 
176,500 



90 
118 
158 
696 
■ 1.402 
934 
777 
51504 
5.546 
6.588 
7.454 
9.068 
10.893 
9.988 
4.774 
4.581 
26.600 
27.000 
19.500 



4.3 
6,1 
7.3 

6.1 

9.9 

9.7 

7.6 

8.4 
10.4 
12.8 
14.3 
20.0 
20.5 
27.1 
10.0 

9.(5 
17.5 
16.7 
11.05 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracVdng the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a 
series of *'gene bins," each of which was be- 
lieved to conUln a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 
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being joined together, resulting in an annot^non 
that artificially concatenated these gene mcsdels. 

Next, knovm genes (those with exact match- 
es of a full-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated zs a 
predicted transcript. A subset of the cuiat- 
ed human gene set RefSeq from the Nanon- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identit>', then the 
SIM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the staois of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
nSned cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8. 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem 
bly. The PW-assembly is indicated in the upper third 
of each panel; the Celera assembly-is indicated in the 
lower third. In llie center of the panel; green lines 
show Celera seqyencTes that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of seqaences. Yellow lines 
indicate sequence blocks that are in the same orien-. 
tation, but out of order. Red Irhes indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter tv/o cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Cetera mate-pair violations 
(red, misoriented; yellov/. .incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are v/ithin the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shovm as blue ticks on each assembly 
axis. Runs of more than 10.000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in wetj 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DC1. 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- . 
tion between mouse and human genomic 
DyfA, similarity to human transcripts (ESTs 
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and cDNAs), similarity to rodent transcripts 
(ESTs and cDNAs). and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 



man genome. The sequence from the region 
of genomic DNA contained in a gene bin was 
extracted, and the subsequences supported by 
any homology evidence were marked (plus 100 
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on all chromosomes. For each chromosome, tije "PP«; P'^ .J^^ Smr^osomf Is Indicated In black, and the chromosome numbers In red. 
represent the PFP assembly, and the lower pair of lines represent ceiera 5 
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bas^ flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced-by N's. This se- 
quence segment, with high confidence regions 
represented by the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by eliminating re^ons with no 
. supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 

• The final Genscan predictions were often quite 

• different -from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of vaUd, small 
exons fix>m the final annotation 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted trans<^ript with the homology- 
based evidence that was U3ed in previous steps 
to evaluate the depth of e\'idence for each exon 
in the prediction. Intemal'cxons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons; the mtemal 
edge was required to be within 10 bases,' but the • 
external edge was allowed greater latitude .to 
allow for 5' and 3' untranslated regions 
(UTRs). To be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of **hils." as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and^th^ese must cover the 
complete predicted open reading frame. For 
a single-exon gene; we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet thesfe criteria were /disfegarde-d. and 

Table'y. Senslth/fjy'and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the predictlonto the published 
RefSeq transcript, tallying the number (N) of 
• uniquely aligned. RefSeq bases. Sensit.vi^r is the 
ratio of N to the length of the published RefSeq 
transcript Specificity Is the ratio of N to the 
length of the prediction. All differences arc signif: 
leant (Tukey HSO; P < 0.001). 



Method' 



-Sensitivity Specincit>^ 



Otto (RefSeq only)* 
Otto (homology)! 
Genscan 



0.939 
0.604 
0.501 



0.973 
0.884 
0.633 



•Refers to those annotations produced by Otto usmg only 
the Slm4-poli$hed RefSeq aBgnmenl rather than an e^- 
Since-baseS Genscan predlcUoa tR«fe« to those 
annotations produced by supplying all avaHable evidence 
to Genscan. 
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those that passed were promoted to Otto 
predictions. Homology-based Otto predic- • 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAH-, Genscan, and FgenesH 
i63)] were nin as part of the computational 
analysis, the results of these programs were not 
direcdy used in making the Otto predictions. 
Otto predicted 11^26 additional genes by 
means of sequence similariQr. • . 

3.2 Otto validation 

To validate the Otto homology-based process 
and the method that Otto uses to define the 
structures of .kno\vn genes, we compared tran- 

• scripts predicted by Otto with their .corTe2>ond- 

ina (and presumably correct) transcnpt from a 
seTof 4512 RefSeq transcripts for which there 

• was a unique SIM4 alignment (Table 7). In 
order to evaluate the relative performance of 
Otto and Genscan. we made three comparisons. 
The first involved a detemiination of the accu- 
racy of gene models predicted by Otto vn± 
only homology data other than the correspond- 
ing RefSeq sequence (Otto homology m Table 
7) We measured the sensitivity (correctly pre- 
dicted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 

• ined the sensitivity and specificity of the Otto 
predictions that were made solely with the Ref- 
Seq sequence, which is the process Aat Otto 

uses to annotate known genes (Otto-RefSeq). 
And third, we determined the accuracy of the 
Genscan predictions corresponding* to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homoloey performed better than Gen- 
scan by both criteria. Thus, 6.1% of true RefSeq 
nucleotides were not represented in the Otto- 
refseq annotation^ and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
* differfenfes' between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data m the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms m the 
data set>used for the comparisons. . . . 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes^.the absence of 

• experimental evidence for intervemng exon? 

inadvertandy result in a set of exons that 

cannot be spliced together to give nsc to a 
' transcript In such cases. Otto may "^ht genes 

when in fact all the evidence should be com- 
bined into a single transcript Wc also exammed 
the tendency of these methods to mcorrectly 
split gene predictions. These trends are shown 
in Fig 8 Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
conservative, we used a different gene-pre- 
dictioii strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene, predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragirient matches. This fmal class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used in the computational pipe-. . 
line. For these, there, was not sufficient 
.sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
which —76,410 were nonredundant (non- 
overlapping with one another). Of these, 
57 935 did not overlap known genes or 
predictions made by Otto. Only 21,350 of 
the gene predictions that did not overiap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near the upper limit for the human gene 
complement As seen in Table 8, if the re- ; 
quirement for other. supporting evidence IS 

made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26.383 and 
demanding three types reduces it to -23.UUU. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would eliminate genes that encode 
novel proteins (members of currently unde- 
scribed protein families). No correction for 
pseudogenes has been made at this pomt m 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the^EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
bg for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the toi-,- 
lowing cvidenci/types-homology to mouse 
genomic sequence fragments, rodent EMS 
or cDNAs-or similarity to a known protein 
rediced this number to 1010. Addmg this to 
the numbers from the previous Vj^S^^^ 
would give us estimates of about 40,000 
27,000, and 24,000 potential genes in tnc 
• human genome, depending on the stnngei^cy 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree oi 
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port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.8 1 exons supported 
by protein homology c\ddeace. As would be 
expected, the Otto transcripts generally have* 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

Summary. This section describes several of' 
the honcoding attributes of the assembled ; 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 



4.1 Cytogenetic maps 
Perhaps the most obvious, and certainly the 
most, visible, element of the structure of 
. the genome is the banding pattern produced 
by Giemsa slain. Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochromatin (<y^). Much of this hetero- 
. chromatin is highly polymorphic and con- 
. sists oif different families'of alpha satellite 
DNAs with various higher order repeat 
structures (65). Many chromosomes have 
complex inter- and intrachromosomal du- 
.plications present in pericentromeric re- 
gions (66). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly. 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for flirther analysis. This set 
includes the 6538 genes predicted by Otto on 
' the basis of matches to known genes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts fi*om de novo gene-prediction pro- 
grams that have two types of supporting ev- 
idence. The 26,383. genes are illustrated along 
chromosome diagrams in Fig; :1. These ai-e a 
•very prelihiinaxy set of annotations arid are 
subject to all the limitations of an automated 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- , 
script predictions. All the predictions and 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "fypf- 
cal" gene in the human DNA sequence to 
be about 27,894 bases. This is based on. the 
average span covered by - RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts promoted to gene 
aimotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amoimts of evidence that sup- 
Table 8 Number of exons and transWipts supported by various ty^es of evidence for Otto and de novo gene prediction "^^^^^J "^h^^^^ ^^^^^^^^ 
2he gene s^^lhl\^.TX of ge'nes selected for protein analysis: Italic, total set of accepted de novo predictions). ^ . . 
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1 2.3 4 5 6 7 8 9 10 11 12 '13 14 15 16 17 
Number of predictions per RefSeq transcript 
Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the tex^ 
frcriteria). and the numbers of overlapping Genscan, Otto (RefSeq only) annotations b^sed so^^ 
on Sim4-polished RefSeq alignments, and Otto homology) annotations (annotations Produced by 
suppMng^U available evidence to Genscan) were tallied. These ^ata.show the degree to which 
multiple Genscan predictions and/or Otto annotations were associated with a single RefSeq 
transcript/ The zero class for the Otto-homology predictions shown here ln<Jicates that the 
Otto-homology calU were made without recourse to the Re/Seq transcript, and thus no Otto call 
was made because of Insufficient evidence. 



Types pf evidence 



" No. of lines of evidence*: 



Total 



Mouse 



Otto 



De novo 



No. of exons per 
transcript 



Number of 

transcripts 
Number of 

exons 
Number of 

transcripts 
Number of 

exons 
Otto 
De novo 



17,969 
141.218 

58,032 

319,935 

7.84 
5.53 



17,065 
111.174 
14;463 



Rodent 
-» — 

14,881 
89,569 
5,094 



Protein 



48,594 19,344 



5.77 
3.17 



6.01 
3.80 



15,477 • 

108,431' - 

8,043 

26.264 

6.99 
327 



• Human 


2:1 


fe2 


&3 


S:4 


• 16,374 


' 17,968t 


17.501 


15.877 


12,451 


•118.869 


140,710 


127,955 


99,574 


59.804 


9,220 


27,350 


8.619 


4,947 


1.904 


. 40,104 


79.148 . 


31.130 


17.508 


6,520 


7^4 
436 


7.81 
3.7 


7.19 
3.56 


6.00 
3.42 


4.28 
3.16 



cDNA. and slmiUrity to known proteins) were 



numbtr Indudef •Itemalhw spUee forms of the 17.764 gen«5 mwBoned elsewhere hi the text. 
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Examination of periccntromcric regions is 
ongoing. 

The remaining --80% of the genome, the 
euchromadc component, is divisible into 

and T-bands {67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to determine precise band 
boundaries at the molecular level. T-bands are 
the most G+C- and gene-rich, and G-bands are 
G+C-poor (68). Bemardi has also offered a 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composidon, ternied isochores (denoted L, HI, 
H2, and H3), which are >300 kbp.in length 
{69). Bemardi defined the L O^g^O isochores as . 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene . 
. concentration has been claimed to be very low 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores {70). By examining 
contiguous 50-kbp windows of G+C content 
across the assembly, we found that regions of • 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isbchores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) ' was 
,1078.6 kbp. The correlation between G+C 
content and gene density, was also examined in . 
50-kbp windows along the assembled sequence ■ 
(Table 9 and Figs. 10 and 11). We found that 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected. However, the conrelation between 
G+C content and gene density was not as 
skewed as previously predicted {69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3-containing 
bands, had the highest- gene density (Table 
10). Conversely, of the chromosomes that we 
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found to have the lowest gene density, X, 4, 
1 8, 13, and Y, also have the fewest H3 bands. 
Chromosome 15, which also has few H3 
bands, did not have a particularly low gene 
density in our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
density, does not appear to be imusual in its 
H3 bariding. 

. , How. valid is Ohno's postulate {72) that . 
mammalian genomes consist of oases of genes 
in otherwise essentially empty deserts? It ap- 
. pears that the human genome does indeed con- 
. tain deserts, or large, gene-poor regions. If we 
define a desert as a region >500 kbp without a 
\ gene, then we see that 605 Mbpi or about 20% 
• of the. genome, is in" deserts. These are not 
uniformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
somes 4, 13, 18, and X have 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
; analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- 
ing of genes. The distance metric, centimorgans 
(cM5» '^s based on the recombination rate be- . 
.tween homologous chromosomes during meio- 

Table 9., Characteristics of C+C In Isochores: 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not uniform across 
the genome (72). One of the opportunities en- 
abled by a nearly complete genome sequence is 
. to produce the ultimate physical map, and to 
fully analyze its correspondence with two other 
maps that have been widely .used in genome 
. and genetic analysis: the linkage map and the 
cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project 

We mapped the location of the markers 
that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex- 
pressed as cM per Mbp, was calculated for 
3-Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
region of the chromosomes have been previ- 
ously documented {73). From this mapping 
result, there is a difference of 4.99 between 
lowest rates and highest rates an^ the largest 
difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
the differences in recombination rates be- 
tu'een males and females. The human ge- 
nome has recombination hotspots, where re- 
combination rntes vary fivefold or more over 
a space of 1 kbp, so the picture one gets of the 
rhagnitude of variability in recombination 
rate will depend -on the size of the window 



Isochore 



C+C (%) 



Fraction of genome 



Fraction of genes 



Predicted* 



Observed 



Predicted* 



Observed 



H3 

H1/H2 
L 



>48 
43-48 
<43 



5 

25 

67 



9.5 
21.2 

' 69.2 



37 
32 
31 



24.8 
26.6 
48.5 



•The predictions wece, based on Bemardi's definiUons {70) of the isocfiore structure of the human genome. 



^ Fig. 9. Comparison of 
the number of exoris 
per transcript between. , 
the 17,968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one . 
line of evidence that 
do not overlap with an 
Otto prediction. Both . 
sets have the highest 
number of transaipts • 
In the two-exon cate- 
go/y, but the de novo 
gene predictions are 
skewed much more 
toward smaller tran- 
scripts. In the Otto set 
19.7% of the tran- 
scripts have one or 
two exons, and 5.796 



7000-1 




@ No. of Otto 
transcripts 



a 



No. of denoYO + 
1 line of evidence 



8 9 10 11 12 13 14 15 16- 17 18 19 20 >20 
Number of exons per transcript 
have more'than 20.' In the de novo set, 493% of the transcripts have one or two exons, and 0.2% have more than 20. 
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examined. Unfortunately, too few raeiotic 
crossovers have occurred in Centre d*Etude 
du Polymoiphism Humain (CE PH) and other 
reference families to provide a resolution any 
finer than about 3 Mbp. The next challenge 
will be to determine a sequeilte basis of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
'such as in positional cloning projects. 

4.3 Correlation between CpG islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
dinucleotides when compared with the entire 
genome (74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (75, 76), In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting (73) and 
tissue-specific gene expression (79) 

Experimental methods' have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG islands in the human genome 
(74, 80) and an estimate of 499 CpG islands' 
on. human chromosome 22 (81), Larsen e/' 
al. (76) and Gardiner-Garden and Frommer ' 
(75) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have a G + C 
content of >50% and a ratio of observed 



versus expected frequency of CG dinucle- 
otidc ^0.6. 

It is difficult to make a direct compari- 
. son of experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation slate of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island , 
/with gene, starts, given a set of annotated ^ 
genomic transcripts and the whole genome 
sequence. We have analyzed the publicly 
. available annotation of chromosome 22, as 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen et al. (76). The main differences are 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we 
used two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a 
higher threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome 
22 close to the number of annotated genes on 
this xhromosome. The main results' are sum- . 
. marized in Table 13. CpG islands computed 
with method 1 predicted only 2.6% of the :. 
CSA sequence as CpG, but 40% of tiie gene 
starts '(start codons) are contained inside. a 
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Fig. 10. Relation between G+C content and gene density. The blue bars show the percent of the 
genome (In SO-kbp windows) with the Indicated G+C content The percent of the total number of 
genes associated with each G+C bin Is represented by the yellow bars. The graph shows that about 
5% of the genome has a C+C content of between 50 and 55%. but that this portion contains 
nearly 15% of the genes. 



CpG island. This is comparable to ratios re- 
ported by others (82). The last two rows of 
the table show the observed and expected 
average distance/ respectively, of the closest 
CpG island from the first exon. The observed 
average closest CpG islands are smaller than 
the corresponding expected distances, con- 
fiiming an association between CpG island 
and the first exon. ^ 
' .We also ipoked at the distribution of CpG 
islaiid nucleotides' among- various sequence 
classes such as intergenic regions, introns, 
exons, and first exons. We computed the 
likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class . 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
intron, 5.86 for exon, and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide analj^sis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 

4.4 Genome-wide repetitive elements 
The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14. We.observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (55). Repet- 
itive sequence maybe underrepresented in 
the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and genej density, which was not observed 
between LlNEs and gene density. 

5 Genome Evolution . . . ^ 
Summary, The dynamic nature ^of genome 
evolution can be captured at several levels. 
These include gene duplications^mediated by 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tiOfi, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11, Genome structural features. 
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(green), EST density (blue), and Alu density (pink) along the lengths of WjndoWs. 
each of the chromosomes. Gene density was calculated In 1-Mbp win>*-^ window. 



5.1 Retrotransposltion In the human 
genome 

Retro transposition of processed xnRNA 
traascripts into the genome results in func- 
tional genes, called intronlcss paralogs, or 
inactivated genes (pseudogenes). A paralog 
refers to a gene that appears in more than 
one copy in a given organism as a result of 



a duplication event. The existence of-b'oth 
intron-containing and intronless forms of 
genes . encoding functionally similar or 
identic^' proteins has been previously de- 
scribed. (^^, 85), Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplicattdh 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Ouo-prcdicted, stngle-exon genes were sub- 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% sequence identity over 90% of the 
length, we identified 298 instance's of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gea- 
Bank data set of experimentally validated 
full:length genes at the stringency :^eciried 
and were verified by manual inspection. : . 

: We believe. that these .97 cas<is may rep-; 
resent intronless paralogs (see Web table 1 on 
Science Online at wyw.scienceinag.org/cgi/ 
content/full/291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All 
of the cases for which we have high confi- 
dence contain polyadenylated [poly(A)] tails 
characteristic of retrotransposition. 

Recent publications describing the . phe- 
nomenon of ilmctional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84, 86), We do not fmd a bias 
toward X chromosome origination of these 
retrotransposed genes; rather,- the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition from a single 
source chromosome to multiple target chro- , 
mosomes. Interesting examples include the - 
retrotransposition of a five exon-cpntainirig 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the isource genes can 
also show variability. The largest example is 
the 31-cxon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoBing regions 
that lead to different functions or expression 
patterns, represents a key 'route to providing 
an enhanced functional repertoire in mam- 
mals {87). ^. 

Our preliminaiy'set of retrotransposed in- 
tronless paralogs contains a clear oyerreprcr - • 
sentatiori of genes involved in translational 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and 'nuclear 
regulation (HMO nonhistone prote/ins, 4%), 
as well a^ metabolic and regulatoiy enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream- 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue.-specific gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 



5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 



pressed. We developed a method for the pre- 
liniinaiy analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces 



Table 11. Genome overview. 



Size of the genome (Including gaps) 
Size of the genome (excluding gaps) 

longest contig ' .* ' *• .. : . .. • " •., . 

. Longest scaffold " • • • • ■ * .. . *• ' 

Percent of A+T In the genome • ■ •* 
Percent of C+C In the genome 
Percent of undetermined bases in the genome 
Most COrich 50 kb 
Least GC-rich 50 kb 
Percent of genome classified as repeats 
Number of annotated genes 
Percent of annotated genes with unknown function 
Number of genes (hypothetical and annotated) 
Percent of hypothetical and annotated genes with unknown function 
Gene with the most exons 
Average gene size 
Most gene-rich chromosome 
Least gene-rich chromosomes 

Total size of gene deserts (>S0O kb with no annotated genes) 
Percent of base pairs spanned by genes 
Percent of base pairs spanned by exons 
Percent of base pairs spanned by Introns 
Percent of base pairs In Intergenlc DNA 

Chromosome with highest proportion of DNA In annotated exons 
Chromosome with lowest proportion of DNA in annotated exons 
Longest intergenlc region (between annotated + hypothetical genes) 
Rate of SNP variation 



2.91 Gbp 
2,66 .Cbp 

.1.99 Mbp ' • - 

14.4 Mbp. 
•54 '. 

38 
9 

Chr. 2 (66%) 
Chr. X (25%) 
35 

26383 
42 

39,114 
59 

Titin (234 exons) 
27 kbp 

Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
Chr. Y (5 genes/Mb) 
605 Mbp ✓ 

25.5 to 37.8* 
1.1 to 1.4* 

24.4 to 36,4* 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (0.36) 

Chr. 13 (3,038,416 bp) 
1/1250 bp 



^•In these ranges, the percentages correspond to the annotated gene set (26. 383 genes) and the hypothetical + 
annotated gene set (39.1 U gsnes). respectively. 

•Table 12. Rate of recombination per physical distance (cM/Mb) across the genome; Genethon markers 
were placed on CSA-mapped assemblies/and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA, not appUcable. 



Male 



Sex-average 



Female 



Chrom. 





Max 


Avg. 


1 


2.60 


1.12 


2 


. .2,23 


0.78 


3 


2,55 


0.86 


4 


- * 1.66 


0.67 


5 


2.00 " 


0.67 


6 


^^'■^ 


6.71 


7 




1.16 


8 


. 1.83 


0.73 


9 


2.01 


0.99 


-lo- 


.3.73 


1.03 


ll 


1.43 . : 


-0.72 


12 


4.12 " 


0.76 


13 


1.60 


0.75 


14 


3.1^-*^ 


0.98 


15 


2.28 


0S4 


16 


1.83 


1.00 


17. 


3.87 


0.87 


18 . 


3.12 


137 


19 \ 


3.02 


0S7 


20 


3.64 


0.89 


21 ... 


3.23 


126 


22 


1.25 


1.10 


X 


NA 


NA 


Y 


NA 


NA 


Genome 


4.12 


0.88 



Min. 


Max 


- . Avg. 


Min. 


Max 


0.23 


2.81 


1.42 


0.52 


339 


033 


2.65 


1.12 


0.54 


3.17 


0.23 


2.40 


1.07 


.0.42 


2.71 


0.15 


2.06 


1.04 . 


. aeo 


2.50 


0.18 


1.87 


1.08 


0.42 


226 


0.28 


237 


1.12 


037 


3.47 


0.48 


1.67 


1.17 


0.47 


227 


0.14 


2,40 


1.05 


0.46 


3.44 


0.53 


1S5 ' 


132 


0.77 


2.63 


0.22 


3.05 


129 


. 0.66 


2.84 


031 


^^^ 


039 


0.47 


3.10 


0.26 


335 


1.16 - 


0.49 


2.93 


0.01 


1.87 


035 


0.17 


2.49 


0.18 


2.65 


/ 130 


0.62 


3.14 


0.^4 


231 


122 


0.42 


233 


. 0.47: 


2.70 


1.55 


0.63 


4.99 


0.00 


354 


135 


0.54 


4.19 


0.86 


3.75 


1.66 


0.43 


435 


0.10 


257 


1.41 


0.49 


. 2.89 


0.00 


2.79 


130 


0.83 


331 


0.69 


237 


1.62 


. 1.08 


2.58 


0.84 


1.88 


1.41 


1.08 


3.73 


NA 


NA 


NA 


NA 


3.12 


NA . 


. NA 


NA 


NA 


NA 


0.00 


3.75 


122 


0.17 


439 



Avg. 



Min. 



1.76 
1.40 
1.30 
1.40 
1.43 
1.67 
121 
136 
'1.66 
1.51 
T32 
135 
1.19 
1.63 
1.56 
232 
1.83 
2.24 
1.75 
2.15 
1.90 
2.08 
1.64 
NA 
135 



0.68 
0.61 
033 
0.77 
0.62 
0.64 
034 
0.43 
0.82 
0.76 
0.49 
0.59 
0.32 
0.75 
0.54 
1.12 
0.94 
0,72 
0.87 
134 
1.18 
0.93 
0.72 
NA 
032 
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. that account for gene inactivation. The gen- 
eral structural charactenjrtics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the 
functional counterparts, a poly(A) tract at the 
3' end,*and direct repeats flan3dng the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of fctrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the genomic se- 
quence by means of BLAST, Genomic re- 
gions corresponding to all Otto-predicted -, 
transcripts were excluded from this analysis. 
We identified 2909 regions matching with - 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an imderestimate because specific 
methods to search for pseudogenes were not 
used, 

. We looked for correlations between 
structural elements and the propensity for 
retrotransposition -in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed- 
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pseudogenes (1177 source genes) versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
: content did not show any significant differ- 
. ence, contrary to a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal' proteins (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins (2%). The increased occurrence of 
• retrotransposition (both intronless paralogs 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 

5.3 Gene dupUcation In the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
. rithm, called Lek, for grouping the predicted 
■ human protein set into protein families (89). 



Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and th^ 
whole genome {2.9.Gbp sequence length) by means.of two different methods. Method 1 uses a CG 
DkeUhood ratio of &p.6. Method 2 us€!^ a CG likelihood ratio of ^0.8. 



Chromosome 22 



Whole genome. 
(CS assembly) 



Number of CpG Islands 

detected 
Average length of bland (bp) 
Percent of sequence 

predicted as CpG 
Percent of first exons that 

overlap a CpG Island 
Percent of first exons with 

first position of exon 

contained Irislde a CpG. 

Island- 
Average distance between 

first exon and <;rosest CpG 

Island (bp).-' 
■ Expected distance between - 

first exon and closest CpG 

Island (bp) 



Method 1 


'Method 2 


Method 1 


Method 2 


5.211 


522 


195.706 


26,876 


390 


535 


395 


497-'. 


53 


0.8 


2.6 


0.4 


44 


" 25 


42 


22 


37 


22 


40 


21 










X0^3 


10,486 


2,182 


17,021 


3,262 . 


32,567^ * . 


7.164 


55.811 











Table 14. Distribution of repetitive DNA In the .compartmentalized shotgun asserhbly sequence. 



Repetitive elemerits 



Me^abases In 
•assembled 
..sequences 



Percent 
of 

assembly 



Previously 
predicted 
(%) (83) 



Alu 

Mammalian Interspersed repeat (MIR) 

Medium reiteration (MER) 

Long terminal repeat (LTR) 

Long Interspersed nucleotide element 

(LINE) 
Total 



288 
66 
50 
155 
466 

1025 



9.9 
23 
1.7 
53 
16.1 

353 



10.0 
1.7 
1.6 
5.6 

16.7 

35,6 



■The, complete clusters that result from the 
Lek clustering provide one basis for compar- 
ing the role of whole-genome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Because each complete cluster rep- 

• resents a closed and certain island of homol- 
ogy, and because Lek is capable of simulta- 
neously clustering protein complements of 

• several organisms, the number of proteins 
contributed by each organism to a complete 
cluster can be predicted with confidence de- 
pending on the quality of the annotation of 
each genome. The variance of each organ- 
ism's contribution to each cluster can then be 
calculated, allowing an assessment of the rel- 

r ative . importance of large-scale duplication 
versus smaller-scale, organism-specific ex- 
pansion and contraction of protein families, 
presumably as a result of natural selection 
operating on individual protein families with- 
in an organism. As can be seen in Fig. 12, the 
large variance in the relative numbers of hu- 
man as compared with D. melanogaster and 
Caenorhahditis elegans proteins in complete 
clusters may be explained by multiple events 
of relative expansions in gene families in 
each of the three animal genomes. Such ex- 
pansions would give rise to the distribution 
that shows a peak at 1:1 in the ratio for 
human-worm or human-fly clusters with the 
slope spread.covering both human and fly/ 
worm predominance, as we observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where wonn and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting on 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set. However, 

'in our analysis, the difference between an 
anoient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were perfomied. 

5.4 Large-scale duplications 
Using two independent methods, - we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly - 
conserved blocks of duplication. We then 
: describe our comprehensive method for identi- - 
iying all interchromosomal block duplications. 
The latter method identified a large numt>er of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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termlncd to be in the same family and the 
same complete Lek cluster (essentially 
paialogous ^enes) {89). Initially, each chro- 
mosome was represented as a string of genes 
ordered by the start codons for predicted 
genes along the chromosome. We co'dsidered 
the Uvo strands as a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and Lek* complete cluster {89). All 
paks of. indexed gene .strings. were then 
aligned in both' the forward and reverse di- 
rections with the Smith-Waterman algorithm 
{90). A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch -10, with gap open 
and extend penalties of -4 and -1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chrocno'somes directly 
with one another using an algorithm based on 
the MUMmer system {91). This alignment 
method uses a suffix tree data structure and 
linear-time algorithm to align long sequences \ 
very rapidly; for example, two chromosomes, . 
of. 100 Mbp can be aligned in less than 20 ' 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana {92); in that, 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For A rabidops is, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the humafi' genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as - follows. First, 26,588 .proteins . 

(9,675,713 million amino acids) were concat- 
enated end-to-end ift; 'order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set \<ras then aligned against cdch chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract 
sets of three or more protein matches that - 
occur in close proximity on two different 
chromosomes {93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refme the 
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filtering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular, every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 

vreal and the shuffled data, with the results on 
the shuffled data being used to estimate the 
false-positive rate. The algonthin after filter- 
ing yielded 10,310 gene pairs in 1077 dupli- 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled data, by con- 
trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. .In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the 1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. 

To illustrate the extent of the detected 
duplications. Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
•to the indexed chromosome are displayed. 

• The figure makes it clear that the duplications 
are ubiquitous in the genome. One feature, 
that it displays is many relatively small chro- 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik-- 
ing. One such example captured by the anal- •• 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 
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tions at several evolutionary stages {94). The 
figure also illustrates that some chromo- 
somes, such as chromosome 2, contain many 
more detected large-scale duplications than 
others. Indeed, one of the largest duplicated 
segments is a large block of 33 proteins on 
chromosome 2, spread among eight smaller 
blocks in 2p, that aligns to a paralogous set on 
chromosome 14, with one rearrangement (see 
chromosomes 2 and 14 panels in Fig. 13). 

* The proteins are not contiguous but span a 
region containing '97 proteins *on chromo- 

' some 2 and 332 proteins on chromosome 14. 
The likelihood of observing this many dupli- 
cated proteins by chance, even over a span of 
this length, is 2.3 X lO"^* (Pi). This dupli- 
cated set spans 20 Mbp on chromosome 2 and 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tains a block duplication that is nearly as 
large, which is shared by chromosome arm 2q 
and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes 18 and 20, serves as a good 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset): This duplication 
contains 64 detected ordered intrachrbmo- 

•somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosorne 18 

: free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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By this measure, the duplication segment 
spans nearly half of each-chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 
would need to be invoked to explain the 
relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 18, and 
among 322 protein assignments on chromo- 
. ' some 20, for a density of involved proteins of 
20 to 30%..This is consistent with an ancient 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As' an independent verification 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of 
. the pairs of aligning protefns in this duplica- 
tion, including some of those annotated (Fig. 
13), are those populating small Lck complete 
clusters (see above). This indicates that they 
are members of very small families ,of para- • 
logs; their relative scarcity within the genome * 
validates the uniqueness and robust nature. of 
their aligiunents. 

Two additional qualitative features were ob-. 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on 5^/- 
ence Onliiie at www.scieifcemag.org/cgi/con- ■ 
tent/fulI/291/5507/1304/DCi). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes arc proteins involved in hemostasis 
(coagulation factors) that are -associated "with 
bleeding disorders, / transcriptional regulators 
like the hotne,obo;i proteins associated with de- 
velopmental disorders, and potassium chaiuiels 
associated with cardiovascular conduction ab- 
normalities. For each of the'se disease genes, 
closer study of Ae paralogous genes in the 
duplicated segment may reveal new insights . 
into disease causation, with further investiga- * * 
tion needed to detemiine whether they might be . 
involved in the same or similar genetic diseases/. 
Second, although there is a conserved number , 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 
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pair of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested. 

Evaluation of the alignment results gives 
some perspective on dating of the duplications. 
As noted above, large-scale ancient segmental 
• . duplication in fact best explains many of the . 
.; : blocks detected by this genome-wide analysis. 
The regions, of human chromosomes involved 
in the large-scale, duplications expanded upon 
above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a distinct mouse chro- 
mosomal region. The corresponding mouse: 
• . .chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
their human synteny partners than the human 
duplication re^ons are to each other. Further, 
the corresponding mouse chromosomal regions 
each bear a significant proportion of genes or- 
thologous to the human genes on which the 
. human duplication assigiunents were made. On 
the basis of these factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 
tion, appear to be products of the same large- 
scale duplications observed in humans. Al- 
though further detailed analysis must be carried 
out once a more complete genome is assembled 
for mouse, the underlying large duplications 
^ appear .to predate the Uvo species' divergence. 
. This dates the duplications, at the latest, before . 
idivergence .of tiie primate arid rodent lineages. . 
This date can be further refined upon cxamina- . 
tion of the synteny between human chromo- 
sornes and those of chicken, puffe'rfish {Fugu 
rubripes), or zebrafish (9S). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions 
(or others) to himian chromosomes is extend- 
ed.-with further mapping, the ages of the 
nearly chroniosome-length duplications seen 
in humans are likely to be dated to the root of 
' vertebrale * di verjgcnce. 

The MUMmer-based results demonstrate 
large block dupUcadons that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient ■■ 
whole-gendme duplication event is the under- 
lymg explanation for the i\umerous duplicated 
regions (96), The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplicadon and mtil- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



veal the stagewise history of our genome, 
with it a history of the emergence of many 
the key functions that distinguish us from other 
living things. 

6 A Genome-Wide Examination of 
Sequence Variations 
Summary, Computational methods were used 
to identify single-nucleotide polymorphismi 
. (SNPs) by comparison of the Celera sequence 
- to other SNP resources. The SNP rate be- 
tween two chromosomes was —1 per 1200 to 
1500 bp. SNPs are distributed nonrandomly 
..throughout the genome. Only a very small 
. proportion of all SNPs (<l%) potentially 
■ impact protein function based, on the func. 
tional analysis of SNPs that affect the pre- 
dieted coding regions. This results in an es- 
timate that only thousands, not millions, of 
genetic variations may contribute to the struc- 
tural diversity of human proteins. 

Having a complete genome sequence cnnblci 
researchers to achieve a dramatic acccIcniiiDn 
in the rate of gene discovery, but only ihrouj^h 
analysis of sequence variation in DNA can u-e 
discover the genetic basis for variation in health 
among human beings. Whole-genome shotgun 
sequencing is a particularly effective method 
. for detecting sequence variation in tandem with 
. whole-genome assethbly.' In addition, we con^- 
. pared the.;distn"bution and attributes of SWs 
ascertained by three other methods: (i) alitin- 
ment of the Celera consensus sequence to the 
PFP assembly, (ii) overiap of high-quality rcci^s 
of genomic sequence (referred to as "Kwok**; 
1,120,195 SNPs) (97), and (iii) reduced rcprt- 
sentation shotgun sequencing (referred to as 
•TSC"; 632,640 SNPs) (98), These data were 
consistent in showing an overall nucleotide di- 
versity of --8 X 10-^ mariced heterogeneity 
across the genome in SNP density, and on 
'oVerwhehning preponderance of noncoding 
variation that produces no change m cxprc.vsca 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 
Ideally, methods of SNP discovery niaJce full 
use of sequence depth and quality at every s^e. 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an expl^t^ 
sampling model (PP)^Comparison of consci^^^ 
sequences in the absence of these details nec^^ 
sitated a more ad hoc approach (qu^'^^"^,. . 
could not readily be obtained for the Vtx 
sembly). First, all sequencedifferencesU^^^ 
•the two consensus sequences were idenhn^ ; 
these were then filtered to reduce the con^^^^^ 
tion of sequencbg enoi^ and ^'"^^^^^^^^^ 
a measure of the effectiveness ofthe^^^ J 
step, we monitored the ratio of tn^nsiuon 
transversion substitutions, because Y'' 
has been weU documented as typical m 
malian evolution (100) and m human . 
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{JO/, 102). The filtering steps consisted of re- 
moving variants where the quality score in the 
Ccleta consensus was less than 30 and where 
he density ofvariants was greater than 5 in 400 
jp. These filters resulted in shifting the transi- 
ion-to-transversion ratio from .•1.57':1 to 
.89: 1. When applied to 2.3 Gbp of alignments 
between the Celera and PFP consensus se- 
luenccs, these filters resulted in identification 
f 2,104,820 putative SNPs from a total of 
,778,474 substitution differences. Overlaps 
etween this set of SNPs and those foimd by 
Ihcr methods are described bclovy. * ' " 
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to their being the smallest two sets. In addition, 
24.5% of the Celera-PFP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46). SNP validation in population 
samples is an expensive and laborious process, 
so confirmation on multiple data sets may pro- 
vide an efficient initial validation "in silico" (by 
computational analysis). 

One means of assessing whether the 



site. These data are not readily available, so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo- 
tide diversity' from high-quality sequence 
overlaps should be possible, but again, 
more information is needed on the details 
of all the alignments. 

Estimation of nucleotide diversity from a 
shotgun assembly entails calculating for each 
three sets of SNPs provide the same picture • - column of the multialignm.ent, the probability 



.2 Comparisons to public SNP 

atabases 

dditional SNPs, including 2,536,021 from 
>SNP (www.ncbi.nlra.nih.gov/SNP) and 
J, 150 from HGMD (Human Gene Muta- 
)n Database, from the University of 
ales, UK), were mapped on the Celera con- 
fisus sequence by a sequence similarity 
irch with the program PowerBlast {103). The 
o largest data sets in dbSNP are the Kwok 
d TSC sets, with 47% and 25% of the dbSNP 
•ords. Low-quality alignments with partial 
.reragc pf the dbSNP sequence and align- 
nts that had less than 98% sequence identity 
ween the Celera sequence and the dbSNP 
iking sequence were eliminated. dbSNP se- 
'nces mapping to multiple locations on the 
era genome were discarded, A total of 
36,935 dbSNP variants were mapped to ^ 
23,038 unique locations on the Celera se- 
nce, implying considerable redundancy in - 
NP. SNPs in the TSC set mapped to 
p8 1 1 xmique genomic locations, and SNPs in 
Kwok set mapped to 438,032 unique loca- 
5. The combined unique SNPs counts used 
his analysis, including Celera-PJFP, TSC, 
Kwok. is 2,737,668. Table 15 shows that a 
tantial fraction of SNPs identified by one of 
J methods was also found by another meth- 
rhe very high overlap (36,2%) between the 
»k and Celera-PFP SNPs may be due in part 
s use by Kwok of sequences that went into 
»FP assembly. The unusually low bveriap 
%) between the Kwok and TSC sets is due 



of human variation is to. tally the frequen 
cies bf;the six possible base -changes in 
each set oJf SNPs (Table 16). Previous mea- 
sures of nucleotide diversity were mostly, 
derived from small-scale analysis on can- 
didate genes {101), and our analysis with 
all three data sets validates the previous 
observations at the whole-genome scale.. 
There is remarkable .homogeneity. beUveen . 
•the SNPs found in the Kwok set, the TSC 
' set, and in our whole-genome shotgun {46) 
in.this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 transition-to- 
transversion ratio observed in the other 
SNP sets. This result Is not unexpected, 
because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2 : 1 transition:transversion ratio for the 
bona fide SNPs would be obtained if one 
assumed that 15% of the sequence differ- 
ences in the Celera-PFP set were a result of 
(presiimabiy random) sequence errors. 



J%5^ Overlap of St^s from genom*e*wide 
Jatabases. Table entries are SNP counts for 
5air of data sets. Numbers Ifj parentheses are 
action of overlap, calculated as the count oi 
tpplng SNPs divided by the number.of SNPs 
» smaller of the two databases compared. 
SNP counts for the databases are: Celera- 
.104,820; TSC, 585.811; and Kwok 438.032. 
Jnique SNPs In the TSC and Kwok data sets 
nctuded. 



6.3 Estimation of nucleotide cliversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used ir, the 
standard statistic for nucleotide diversity 
{104). Nucleotide diversity is a measure, of 
per-sile heterozygosity, . quantifying the 
probability that a" pair of chromosomes- 
drawn from the population will differ at a 
nucleotide site. In ocdec to calculate nucle- 
otide diversity for .each chromosome, we 
need to know the' n'urhber of nucleotide 
sites, that were surveyed for variation, and 
in metho.ds like reduced resp^esentation se- 
quencing, we need to know the sequence 
quality and the depth* of coverage at each 



that nv6 or. more distinct alleles are; present, 
and the probability of detecting a SNP if in 
fact the alleles have different sequence (i.e., 
the probability of correct sequence calls). The 
greater the depth of coverage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP {lOS), Even 
after correcting for variation in coverage, the 
nucleotide diversity appeared to vary across 
autosomes. The significance of this heteroge- 
neity was tested by analysis of variance, with 
estimates of -tt for 100-kbp windows to esti- 
mate variability within chromosomes (for the 
Celera-PFP comparison, F = 29.73, P < 
0.0001). ✓ 

Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 X 10"^ Nucleotide diversity on 
the X chromosome was 6.54 X 10""*. The 
X is expected to be less variable than au- 
tosomes, because for every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller ef- 
fective population size means that random 
drifl will • more rapidly remove variation 
from the X {106), ' ' ' . 

Having .ascertained nucleotide variation 
.genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate {101, 102,106, 107). Genome-wide, 
our estimate of nucleotide diversity was 
8.98 X 10"» for the Celera-PiT alignment, 
and a published estimate averaged over 10 
densely resequenced human genes was 
8.00 X 107^ (108). 

6.4 Variation In nucleotide diversity 
across the human genome 
Such an apparently high degree pf variabil- 
ity among chromosomes: in SNP density 
raises the question of whether there is het-. 
crogeneity at a finer scale within chromo- 



. Table 16. Summary of nudeotide changes In different SNP d^ta sets. 
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^November 2000 release of the NCBI daUbase dbSNP (vwvwjidjilmjiih.gov/SNP^ with the method deFined as Overlap 
SfipDetectionWithPolyBayes. The submitter of the d^U Is Pul-Yan Kwok ham Washin^on University. fNovember 
2000 release of NCBI dbSNP (wwwjicb?j»lfanih.gov/SNP/) with the methods defined as TSC-Sanger. TSC-WICGR, and 
TSC-WUGSC The submitter of the daU Is Lincoln Stein from Cold Sprfr^ Harbor Laboratoty. 
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fig. 13, Segmental duptlca- . 
tions between chromo- * 
somes h the human ge- 
nome. The 24 paneb show 
the 1077 dupb'cated blocks 
of genes, containing 10310 
pairs of genes in total Each 
une represents a pair of ho- 
, molc>gous genes belonging 
to a block; all blocks con- 
tain at least three genes 
oh each of the chromo- 
somes where they appear. 
Each panel shows all the 
- duplications between a 
' single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red Une for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. torn \^In each panel or- 
dered by chromosome 
number. The Inset (bot- 
tom, center right) shows a 
dose-up of one duplka- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display xhQ gene 
names of 12 of the 64 
gene pairs shown. 
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somes^ and whether this heterogeneity is 
V greater than expected hy chance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores 
the different recombination rates and popula- 
tion histories that exist in different regions of . 
the genome. Population genetics theoiy holds . 
that we, can account for this variation with a • 
mathematical formulation called the neutral . 
coalescent {J 09), Applying well-tested algo-. 
rithms for simulating the neutral coalescent : 
with recombination (J JO), and using an ef- 
fective population size of 10^000 and a per- 
base recombination rate equal to the mutation 
rate (I I J), we generated a distribution of num- 
bers of SNPs by this model as well QII2). The 
observed distribution of SNPs has a much larg- . 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This imphes that there is significant 
variability across the genome in SNP density, 
an observation that begs an explanation: 

- Several attributes of the DNA sequence 
may affect the local density of SNPs, in- 
cluding the rate at which DNA polymerase ; 
makes errors and the efficacy of mismatch - 
repair. One key factor that is likely to be • 
associated with SNP density is the G4-C' . 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to. under- 
go deaminatton to form thymine, account- 
ing for a nearly 10-foId increase in the 
mutation. rate of CpGs over other dinucle-. 



•. otides. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 
but G+C content accounted for only a 
small part of the variation. 

6.5 SNPs by genomic class 
To test homogeneity of SNP • densities 
across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
frofn any predicted transcription unit), 5'- 
UTR, exonic .(missense and silent), in-. 
■ tronic, and 3'-UTR for 10,239 known , 
genes, derived from the NCBI RefSeq da- - 
tabase and all human genes predicted from 
the Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent, for those lhat do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- 
pared with the neutral expectation, consis- 
..tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (/ 72). These ratios are com-.; 
parable to tfie missense-tp-silent ratios of 
0.88 and 1.1 7, found by Cargill et al {101) : 
and by Halushka fl/. (i 02). Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences {46), 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Refr 
Seq genes, missense SNPs were only aboiit > 




250 



Number of SNRs / 100 Md 



Fig. 14. SNP density In each 100-kbp Interval as determined with Celera-PFP SNPs. The color codes 
are as follows; black. Celera-PFP SNP density; blue, coalescent model; and red. Poisson distribution. 
The figure shows that the distribution of SNPs along the genome Is nonrandom and Is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, and 0.17% of the total SNP 
counts in Celera-PFP, TSC, and Kwok 
SNPs, respectively. Nonconservative pro- 
tein changes constitute an even smalJer frac- 
tion of missense SNPs (47, 41, and 40% in 
Celera-PFP, KwoK and TSQ. Intergenic re- 
gions have been virtually unstudied {US), and 
• we note that 75% of the SNPs we identified 
were intergenic (Table 17). The SNP rate was 
highest in introns and lowest in exons. The SNP 
.rate was lower in intergenic regions than in 
introns, providing one of the first discriminators 
between these two classes of DNA. These SNP 
rates were confinned in the Celera SNPs, which 
. also exhibited a lower rate in exons than in 
• ' introns, and in extragenic regions than in in- 
trons {46). Many of these intergenic SNPs will 
provide valuable information in the form of 

• markers for linkage and association studies, and 
some fiaction is likely to have a regulatory 
function as well. ^ 

7 An Overviev/ of the Predicted 
Proteln-Coding Genes In the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 

• prominent differences and • similarities 
when the human genome is compared with 
other fully. sequeiiced eukaryotic genomes. 

. Over . 40%-of . the predicted protein set in 
humans cannot be ascribed a molecular 
function by methods that assign proteins to 
known families. A protein domain-based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
.worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. 

A preliminaiy analysis of the predicted hu- 
man protem-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383_ gene 
predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database {114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) {116y 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases {115, 117), 

The results presented here are preU ^ 
naiy and are subject to several liinitr / 
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Both the gene predictions and functional 
assigiunenls have been made by using com- 
putational tools, although the statistical 
models in Panther, Pfam, and SMART have 
been, built, annotated, and reviewed by ex- 
pert biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogeaes) and false-negative predic- 
tions (some human genes will hot be computa- 
tionally predicted). We also, expect errors in 
delimiting the boundaries of exons and genes. 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein faniilies 
that tend to be found across several organisms, 
or on fanulies of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken. from the set of. 
26,588 predicted proteins, which were assigned 
functions by using statisdcal score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene -- 
products, and how are these proteins cate- 
gorized with current classification meth- 
ods? (ii).What are the core functions that/- 
appear to be common across the animals? 



(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 

7.1 Molecular functions of predicted 
human proteins 

Figure 15 shows an overview of the puta- 
tive molecular functions of the predicted 
26,588 human proteins that have at .least 
two lines of . supporting' evidence. About 
41% (12,809) of the geiic products could 
not be classified from this initial analysis 
and are termed" proteins with unknown ; 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein families, there are a number of 
"unclassified" sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functional predictions 
are based on similari^ to sequences of 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
(5%) of -these additional putative genes were 
•assigned molecular functions by the automated 
^m6thods; * One-third of these 636 predicted 
genes represented endogenous retroviral pro- 
teins, . further suggesting that the majori^ of . 



these unknowTi-function genes are not real 
genes. Given that most of these additional 
. 12,095 genes appear to be unique among the 

. . genomes sequenced to date, many may simply 
• represent false-positive gene predictions. 

The most common molecular functions are 
the transcription factors and those involved in 
nucleic acid metaboh'sm (nucleic acid enzyme). 

• Other functions that are highly represented in 
the huraan genome are the receptors, kinases, . 
and hydrolases. Not surprisingly, "most of the 

^ hydrolases are proteases. There are also many 
proteins that are. members of proto-oncbgene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric GTP-binding proteins (G proteins) and 

. cell cycle regulators, and (u) proteins that mod- 
ulate the activity of kinases, G proteins, and 

phosphatases. 

Table 17. Distribution of SNPs In classes oF 
genomic regions. 
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Fig. 15. Distribution 
of the ' molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given 
categoiy of molecular 

. function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories fn 
the Gene'* Ontology 
(CO} (779}. and the 

. Inner drde shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (77^. 
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7.2 Evolutionaiy conservation of core 
processes 

Because of the various "model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beginning the 
analysis of .the evolution of the human ge- 
nome. The genomes of S. cerevisiae (**bak-. 
ers* yeast") {US) and two diverse inverte- 
brates, C. elegans (a nematode worin) (J J 9) 
and D, melanogaster (fly) (25), as well as the 
first plant genome, A, thaliana, recently com- 
pleted (92), provide a diverse background for 
genome comparisons. 

We enumerated the **strict ortholdgs" con- 
served between himian and fly, and between 
human and worm (Fig. 16) to address the 
question. What are the core functions that 
appear to be common across the animals? 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- 
served protein set"), and therefore are likely 
to perform 'similar, conserved functions in the . 
different organisms. It is-critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two orgaiiisms by descent frorn, a common 
ancestor) from paralogs (a gene that appears ■ 
in more than one copy in a given organism by . . 
a duplication event) because paralogs may . 
subsequently, diverge in function. Following 
the yeast-worm ortholog comparison in 



The human genome 

{120), we identified two different cases for" 
each pairwise comparison (human-fly and 
human-worm). The first case was a pair of 
genes, one from each organism, for which 
there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no , 
• .additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes \vith 
.. more than one member in either.or both of the 
organisms being compared. Chervitz et aL 
{120) deal with this case by analyzing a 
phylogenetic tree that described the relation- 
ships between all of the sequences in both 
. organisms, and then looked for pairs of genes 
.that were nearest neighbors in the tree. If the 
nearest-neighbor pairs were from different 
organisms, those genes were presumed to be 
orthologs. We note that these nearest neigh- 
bors can often be confidently identified from " 
pairwise sequence comparison without hav- 
ing to examine a phylogenetic tree (see leg- 
end to Fig. 16). If the nearest neighbors are 
not from different organisms, there has been • 
. a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog . 
becomes ambiguous. For our initial compu- 
tational overview of the predicted human pro- - 
tein set, we could not answer this question for 
eveiy predicted protein. Therefore, we con- 



sider only "strict orthologs," i.e., the proteins 
.with unambiguous one-to-one relationships 
(Fig. 16). By these criteria, there are 2758 
strict human-fly orthologs, 2031 human- 
worm (1523 in common between these sets). 
We define the evblutionarily conserved set as 
those 1523 human proteins that have strict 
. orthologs' in both \D. '.melanogaster and C. 
elegans. \ ' [ ' ■ 

The distribution of the functions of the 
conserved protein set is shown in Fig. 16, 
Comparison with Fig. 15 shows that, not 
surprisingly, the set of conserved proteins is 
.not distributed among molecular functions in 
the same way as the whole human protein set. 
Compared \vith the whole human set (Fig. 
. 15), there are several categories that are over- 
represented in the conserved set by a factor of 
~2 or more. The first category is nucleic acid 
enzymes, primarily the transcriptional ma- 
chinery (notably DNA/RNA methyltrans- 
ferases, DNA/RNA polymerases, helicases, 
DNA ligases, DNA- and RNA-pfocessing 
factors, nucleases, and ribosomal proteins). 
The basic, transcriptional and translational 
machinery is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaiyotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved, among the animals. 
Other enzyme types are also overrepresent- 
ed (transferas,es, oxidoreductases, ligases, 
lyases, and isomerases).' Many of these en- 



Fig. 16. Functions of putative 
Ofihologs across vertebrate . 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" betvvaen 
the human, fly. and worm ge- 
nomes Involved In a given cat- 
egory of molecular' function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits X7S0) such that each • = 
orthologous pair (i) has a;. 
BIASTP P'^^Wt of :S10-1^. 
(720). and (i?) has'a'more sig- 
nificant BL^TP ' score than 
any paralogs' In either orgaa— 
km, I.e, there has likely been 
.no duplication subsequent/ 1() 
spedation that might make 
the orthology ambiguous. This 
measure k auite strict and Is a 
lower bound on the'number'of 
orthologs. . By these criteria, 
there are 2758 strict human- 
fly orthologs and 203'i hu- 
man-worm ortnotqgs (1523 In 
common between these sets}. 
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zymcs are involved in intcnncdiary metabo- 
lism. The only exception is the hydrolase 
category,, which is not significanlly ovcrrep- 
resented in the shared protem set. Proteases 
forrn the largest part of this category, and 
several large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also oyerreprcscnted in the con- 
served set. The major fconserved famih'es are 
• small, guanosine triphosphatases (GTPases). 
(especially the Ras-related superfamily, in- 
eluding ADP ribosylation factor) and cell 
cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 
port and trafficking, and chaperones. The . 
most conserved groups in these categories are ■ 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 families]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived Jrom the- lasi common 
ancestor of the human, fly, and woifn. As 
stated before, this analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli— - 
cation makes the determination , of true or- 
thologs difficult within the members of con-..*'. 
served protein families. 



7,3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 
To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
fiuman genome with the other sequenced 
sukaiyotic genomes at three levels>molec- 
ilar functions, protein families, and protein 
iomains.- 

Molecular differences can be correlated 
vith phenolypic differerices to begin to reveal 
he developmental aijd cellular processes .that 
re unique to the vertebrates. Tables 18 and 
9 display a comparison among ^11 sequenced 
ukaryotic genomes, oyer /elected protein/ 
omain families (defined tiy sequence simi- 
irity, e.g., the serine-threonine protem ki- 
ases) and supcrfamilies (defined by shared 
lolecular function, which may mclude sev- 
ral sequence-related families, e.g., the cyto- 
ines). In these tables we have focused on 
uper) families that are cither veiy large or ' 
at differ significantly in humans compared 
ith the other sequenced cukaiyofe genomes. 

have found that the most prominent hu- 
an expansions are in proteins involved in (i) 
quired immune functions; (ii) neural devel- 
►ment, structure, and functions; (iii) inter- 
llular and intracellular signaling pathways 



in development and homeostasis; (iv) hemo- 
stasis; and (v) apoptosis. 

Acquired immunity. One. of the most 
striking differences between the human ge- 
nome and the Drosophila or C, elegans ge- 
. nome is the appearance of genes involved in 
acquired immunity (Tables 18 and 19). This 
is expected, because the acquired immune 
response is a defense system that only occurs . 
in vertebrates. We observe 22 .class I and 22 " 
class . U majorXhistdcorrijpatibility' complex /; 
(MHC) antigen genes and 114 other immu- 
noglobulin genes in the humjan gerioinei. In 
addition, there are 59 genes in the cognate 
immunoglobulin receptor family. At the do- 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 
globulin fold to. constitute molecules such as . 
MHC, and oftheintegrin fold to form several . 
of the cell adhesion molecules that mediate 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family, of secreted 4-alpha helical 
bundle proteins, namely vthe cytokines and : . 
chemokines. Some of the cytoplasmic signal 
transduction components associated with cy- 
tokine receptor signal transduction are also 
features that are poorly represented in the fly 
and worm. These include protein domains 
found in the signal transducer and activator of 
transcription (STATs), the suppressors of cy- 
tpkiiie signaling (SOCS), and protein inhibi- 
."-tors of activated STATs (PIAS). In contrast, 
many of the animal-specific protein domains 
-that play. a role in iimate.imniuiie response, 
such as ^e Toll receptors, do noVappear to be 
significantly expanded in the human genome. 
Neural development, structure, and 
. function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein , families that are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve 
growth factor, and signaling molecules 
such as semaphorins, a§ .well as the number 
of -proteins invoWe3 directly in neural 
structure and function such, as myelin pro- 
Veins, voltage-gated ion channels, and syn- 
aptic' proteins such as synaptotagmin.' 
These observations correlate well with the 
. known phenotypic dififeren'ces between the 
nervous systems of ttte/e taxa, notably (i) 
the increase in the number and coimeclivity 
of neurons; (ii) the increase in number- of* 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (/2/); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically ' 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
ment. Of the extracellular domains that me- 
diate cell adhesion, the cormexin domain- 
containing proteins {122) exist only in hu- 
mans. These proteins, which arc not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits 
of intercellular charmels and the structural 
: basis' for electrical coupiing.i Pathway find- 
irig by axons and neuronal network forma- 
tion is mediated through a subset of ephrins 
and their cognate receptor tyrosine kinases / 
that act as positional labels to establish 
topographical projections' (/2i). The prob- 
able biological role for the semaphorins (22 
in human compared with 6 in the fly and 2 
in the worm) and their. receptors (neuropi- 
lins and plexins) is that of axonal guidance 
molecules (J24). Signaling molecules siich 
as neurotrophic factors and some cytokines 
have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
{125). Notch receptors and ligands play 
: important roles in glial cell fate determina- 
tion and gliogenesis {126). 
. Other human expanded gene families play 
. key roles directly in neural structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative 
to the .invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca^"*" sensor (or receptor) during , synaptic 
: vesicle fusion and release {127); Of interest is 

the increased . co-occurrence in humans of . 
• PDZ and the SH3 domains in neuronal- 
, specific adaptor molecules; examples include 
. proteins that likely modulate channel activity 
at synaptic junctions (/2<5).. We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
(related to cyclic nuclejDtide gated. channels), 
the voltage-gated calcium/sodiiiin channel 
family, the inward -rectifier potassium chan- 
nel family, arid the. voltage-gated potassium 
charmel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calciurri-regulated association between 
sodium charmels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability {129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin PO Is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18. Domatn-based comparative analysis of proteins In H. sapiens (H) 
D. melanogaster (F). C elegans (W). 5. cerevhlae (Y). and/, that/ana (A). The . 
predicted protein set of each of the above euJcaryotic organisms was analyied 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins 
containing the specified Pfam domains as weU as the total number of domains 
(In parentheses) are shown in each column. Domains were categorized Into 
cellular processes for presentation. Some domains (i.e., SH2) are listed fn 



more than one cellular process. Results of the Pfam analysis may differ fror« 

hmitations o Marge-scale automatic classifications. Representative examples 
tlul^T'"^''^ ''^r^ ^^""5^"^ E value cutoff us^d for 

this analysis are marked with a double asterisk (-). Examples Include short 

^^X"' 'K^.^'^'r^'"'"'"''^ alpha.helical domain^, and certardasses^ 
«ysteine-rich zinc finger proteins. '•'^ses or 



Accession 
number 



Domain name 



■ ■ Domain description 



H 



W 



Pf02039 
PF00212 
PF00028 
Pf00214 
PFOlllO 
PF01093 *. 
* PF00029 ' 
PF00976 
PF00473 
PF00007 
PF00778 
PF00322 
PF00812 
PF01404 
PF0O167 
PF01534 
PF00236 
PF01153 
PF01271 
PF02058 
PF00049 
PF00219 
PF02024 
PF0O193 
PF00243 
PF02158 
PF0dl84 
PF02070 
PF00066 
PF00865 
PF00159 
PF01279 
PF00123 
PF00341 
PF01403 
PF01033 
PF00103 
PF02208 
PF02404 
PF01034 
PF00020 
PF00019 
PFO1099 
PF01160 
PF00110 



Adrenomedullin 
ANP 
Cadherin 
Cal<LCGRP.IAPP 
CNTF 

•Clusterin * • 
Coniiexin 
ACTH^domain 
' CRF 
Cys_knot 
DIX 

Endothelln 
Ephrln ' ■• 
EPhJbd 
FCF 
Frizzled 
Hormones 
Glypican _ , ■ * 
Cranin 
Cuanylin 
Insulin 
rCFBP 
Leptin 
Xlink 
NGF 

Neuregulin 
Hormones 
NMU 
Notch 

Osteopontin 
Hormones 
Parathyroid 
Hormone2 
PDGF 
Sema 

Somatomedir^B- ' 
Hormone 
Sorb 

SCF • ' • 
Syndecan 
TNFfLce 

TGF-p . : 

^Uteroglobin 
OpI6di.neuropep ■ 

wnt : 



PF01821 
PF00386 
PF00200 
PF0O7S4 
PF01410 
\PFOO039 
PF00040 
PFOOOSI 
PF01823 
PF00354 
Pf00277 
PFOQOBA 
PF02210 
PF01108 
PF00d68 
PF00927 



ANATO 
Clq 

Disintegrin 
F5.F8.type^C 
COLFI 
Fnl 

Fn2 • • 

Kringle 

MACPF' 

Pentaxin 

SAVprotelns 

Sushi 

TSPN 

Tlssue^fac 

Transglutamin^N 

Trans^utamIn_C 



: Developmental and honieosiattc 

Adrenomedullin 
Atrial natriuretic peptide 
Cadherin domain 
Calcitonin/CCRP/IAPP family 
Ciliaiy neurotrophic factor 
Clusterin ■*..'* 
Connexin 

Corticotropin ACTH domain 
Corticotropin-releasing factor family 
Cystlne-knot domain 
Dfx domain 
Endothelin family 
Ephrin 

Ephrin receptor \igan6 bm^img domain 
Fibroblast growth factor 
Frizzted/Smoothened family membrane region 
Glycoprotein hormones 
Glypican 

- - Grainin (chromogranln or secretogranin] 
Guanylin precursor 
Insulin/IGF/Relaxln family 
Insulin-l/ke growth factor binding proteins 
Leptin • 

LINK (hyalUron bifiding) 
N erve growth fa ctor family 
Neuregulin family 
Neurohypophysial hormones . 
Neuromedin U - ' . ' 

Notch (DSL) domain 
Osteopontin ' ' ' 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor [POC?) 
Sema domain 
Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 
Stem cell factor 

Syndecan domain , ' ' 

TNFR/NCFR cystelne-rlcH /^gJon* ; \ .- ■ 
' Transforming growth factor p-like domain 
-.- Uteroglobin family . • ' 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 
^ - //emostas/j 
Anaphylotoxiri-like domain . • ' 6(14] 
. <:iq domain ' \, ^ -* 24 

Dislntegrin . • ' is 

F5/8 type C domain , - 15(20) 

Fibrillar collagen C-termlnal domain . . lo 

Fibrone'ctin type I domain" * * 5(18' 

Fibrdnectin type 11 domain • * 11 (16' 

Kringle domain ;^ \ ' '15 (24 

MAC/Perforin domain g 
Pentaxin family - - g 
Serum amyloid A protein 4 
Sushi domain (SCR repeat) ?3{191) 
Thrombospondin N-termlnaWike domains 14 
Tissue factor 1 
Transglutaminase family 5 
Transglutaminase family 3 



regulators 

1 

2 

. 100(550) 
3 

1 . 
3 

. .14(16) 
1 
2 

10(11) 
5 
3 

7(8) 
12 
23 
9 
1 
14 
3 
1 
7 
10 

13(23) 
3 
4 
1 
1 

-3(5) 
1 
3 
2 

5(9) 
5."- 
27(29) 
5(8) 
1 
2 
2 
3 

17(31) 
27(28) 
3 
3 

ia 



r 



0 
0 

14(157) 
0 
0 

:' 0 • 
' 0 
0 
1 
2 
2 
0 
2 
2 
1 

7 ■ 

0 

2 

0 

0 

4 

0 

0 

0 

■o" 
0 
0 
0 

2(4) 
0 

• 0 
0 
0 

1 

8(10) 

3 
. 0 
-0 

0 

1 

1 
6 
0 
0 

7(10) 



0 
0 

16(66) 
0 
0 
0 
0* 
0 
0 
0 
4 
0 
4 
1 
1 
3 

0 
1 

0 

0 

0 

0 

0 

1 

0 • 

' ' 0 
. 0 
0 

2(6) 
0 
0 
0 
0 
0 

3(4) 
0 
0 
0 
0 

1 

0 
4 
0 
0 
5 



0 
0 
0 
0 

0 

0. 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 



0 
0 
0 
0 
0 

o 

0 ■ 

. 0 

o 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 
0 
0 
0 
0 

o 

0.' 
0' 
0 

o 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
. 0 
0 
0 
- 0 
. 0 

b 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 


0 


0 


0 


0 


0 


0 


0 


2 


3 


0 


0 


5(6) 


2 


0 


0 


. 0 


0 ' 


0- 


0 


. 0 


' 0 


.0 . 


0 


6 


0 


0. 


•0 


2 


2 . '* 


0 


0 


0 


0 


0 


0 


0 


0 • 


0 


0 


0 


0 


0 


0 


11(42) 


. 8(45) 


0 


0 


1 


0 


0 


0 


0 


0 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 



38 
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Accession 
number 



Domain name 



- Domain description 



W 



PF00594 



Gla 



Pf 00711 
Pf0O748 
PF00666 
. PF0pi29 

PfO0993 
Pf00969 
?l^00&79 
Pf 01 109 
Pf00047 
PF00143 
PF00714 
PF00726 
PF02372 
PF0O715 
PF00727 
.PF02025 
PF01415 
PF00340 . 
PF02394 
PF02059 
PF00489 
PF01291 

PF00323 
PF01091 
P?06277 
PF00048 

.PF01582 
PF00229 
PF00088 

PF00779 
PF0O168 
PF00609 
PF00781 
PF00610 

PF01363 
PF0O996 
PF00503 
PF00631 
PF00616 
PFO66I8 

PF00625 
PF02189 
PFA0169 
PF00130 

PF00388 



PF00387 

PF00640 

PF02192 

PF00794 

PF01412 

PF02196 

PF02145 

PF00788 

PF00071 

PF00617 

PF00615 

PF02197 



Defens?n_bcta 
Calpainjnhib 

* Cathelicidins • 

• .MHCJ 

MHCJLalpha** 
MHCLILbeta** 
Defensin propep 
CM„CSF 

Interferon 
IFN-gamma 
IL10 
IL15 
IL2 
IL4 
1L5 
IL7 
ILl 

IL1_propep 
IL3 
IL6 

LIF^OSM 

Defensins 
PTN.MK 
SAA^roteins 
IL8 

TIR ■ 
TNF 
Trefoil 

BTK 
C2 

DAGKa 
DAGKc 
DEP 

FYVE * 
GDI 

G-alpha . 
G-gamma 
RasGAP ^ 
RasGEFN 

Guanylate.kln 
ITAM f 

DAG.PE.bfnd j, 
PJ-PLC-X " 
PJ-PLC-Y 



PID 

P13iep85B 
Pi3ierbd ■ 
* ArfGAP 
RED 

Rap^GAP 

RA 

Ras 

RasGEF 

RGS 

RJIa 



* A'itamin K-dependent carbo^^Iation/gamQia- 
carboxyglutamlc (GIAJ domain 

Immune response 

Beta defeniin 
Calpain inhibitor repeat - 
Cathelicidins ;* - . ' : .. , 
Ozss I histocompatibility antigen, domains alpha 1 
and 2 * * ' * • ' * 

Class II histocompatibility antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleuJdn-10 

lnterleuWn-15 

lnterleukin-2 

lnterteukrn-4 

lnleflcukin-5 

lnlerteukin-7/9 family 

InterleukIn-1 • 

lnterIeukin-1 propeptide 

lnterleukln-3 

Interieukin-6/G.CSF/MGF famfly 

Leukemia Inhibitoiy factor (Liq/oncostatin (OSM) 

family 
Mammalian defensin 
PTN/MK heparin-binding protein 
Serum amyloid A protein 
Small cytokines (intecnne/chemokine). 

Interleukin-8 like 
TIR domain ~ . " " 
TNF (tumor necrosb factor) family - 
Trefoil (P-type) domain 

• PI'PY-rho CTPase signaling 

BTK motif 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) - 
Domain found in Dishevelled, Egl-10, and 

Pleckstrin (DEP) 
FYVE zinc finger 
QOP dissociation Inhibitor 
C-protein alpha subunit • , 
G-pfotein gamma like domains 
GTPase-activator protein for Ras-like CTPase 
Guanine nucleotide exchange factor for Ras-like 

CTPases; N-terminal motif 
Guanylate kinase ' r^- -• ** 

rmmuopceceptor.tyrosine-based activation motif 
PH^maln l 
Phorbol esters/dia^lglycerol binding domain (01 
• domain} 

Phosgfiatidylinositol-specific phospholipase C, X . 
domain 

PhosphalidyUnositol-specific phosphollpdse C Y 

domain 

Phosphotyrosine Interaction domain (PTB/PID) 

Pia-kinase family p85-blnding domain ' * ■ 

Pi3-kinase family, ras-blnding domain 

Putative GTP-ase activating protein for Arf 

Raf-Uke Ras-blnding doqirfn 

Rap/ran-GAP 

Ras association (RatCDS/AF-6} domain 
Ras family 
RasGEF domain 

Regulator of C protein signaling domain 
Regulatoiy subunit of type fl PKA R-subunit 



11 



. 3(9) 
• 2 
; 18(20) 

5(6) 
7 
3 
1 

381 (930J 
7(9) 



2 
2 



24(27) 
" 2 
6 
16 
6(7) 
5 

18(19) 
126 
2.1 
27 
4 



0 
0 
0 

0 

o 

0 
0 

125(291) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
O 
0 



0 
0 

0. 
0 

0 
0 
0 
0 

67(323) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

o 

0 
0 



13 
1 
3 
9 
4 
4 

7(9) 
56(57) 
8 

6(7) 
1 



11(12) 
1 
1 
8 
1 
2 
6 
51 

12(13) 
2 



0 
- 0 
0 

.; b 

0 
0 
0 
o 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



2 


0 


0 


0 


2 


0 


0 


0 


4 


0 


0 


0 


32 


0 


0 


0 


18 


8 


■ 2- 


b 


'12 " 


0 


0 


0 


5(6) 


0 


2' 


0 


5 


1 


.0 


0 


73(101) ' 


32(44) 


24(35) 


6(9) 


9 


4 


7 


0 


10 


8 


8 


2 


12(13) 


4 


10 


5 


28 (30) 


14 


15 


5 


6 


2 




1 


27(30) 


10 


20(23) 


2 


16 


•5 . 


5 


1 


11 


5 


8 


3 


9 


2 


3 


5 


.12 


8 


7 


1 


3 


0 


0 


0 


?3(212) 


72(78) 


65(68) 


24 


45(56) : 


25(31) 


26(40) 


1(2) 


12 


3 


7 


1 


11 . 


2 


7 


1 



0 
0 
0 

'6 
0 
0 
1 

23 
5 
1 
1 



0 
0 

: 0. 
■ 0. 

• ■ 6 
0 
0 
0 

Q 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 



131 (143) 
0 
0 

0 

66 (90) 
6 

11(12) 
2 

15 
3 
5 
0 
0 
0 

4 
0 
23 
4 



8 

0 
0 
0 
15 
0 
0 
0 
78 
0 
0 
0 
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Accession 
number . 



Domam name" 



Domain description 



W 



PF00620 
PF00621 
PF0O536 
PF01369 
PF00017 
PF00018 
PF01017 
PF00790 
PF00568 



PF00452 
' PF02180 
PF00619 
PF00531 
PF01335 
PF02179 
PF00656 
PF00653 

PF00022 
PF00191 
PF00402 
PF00373 

PFoosao 

. PF00681 
PF00435 
PF00418 
PF00992 
PF02209 
PF01044 

PF01391 
• PF01413 

Pf00431 
PF00008 
PF00147 

PF00041 
PF00757 
PF00357 
PF0O362 
PF00052 
PF00053 
PF0OO54 
PF00055 

PF00059 

PF01463 

PF01462 

PF00057 

PF00058 

PF0O530 

PF00084 

PF00090 ■ 

PF00092 

PF00093 

PF00094 

PF06244 . 

PF00023 

PF00514 

PF00168 

PFO0O27 
PF01556 
PF00226 
PFCX)036 
PF00611 
PF01846 
PF00498 



RhoGAP 

RhoCEF 

SAM 

Sec7 

SH2 

SH3 

STAT 

WHS 

WHl 

Bcl-2 
- BH4 
. CARD 

Death 

DED 

BAG 

ICE_p20 

6IR 

Actin 

Annexin 

Calponin 

Band_41 

NebuUn^repeat- 

Plectfn.repeat 

Spectrin 

Tubulin-binding 

Troponin 

VHP • 

VincuUn 

Collagen 
C4 

CUB 
EOF 

FIbrinogen.C 



Fn3 

Furln-like 
IntegrinJ^ 
Integrin^B 
Lamlnln^B 
Laminin_|GF 
laminin^G 
Laminin.Nterm 
Lcctin_c 
* LRRCr 
. .LRRNT. 
LdLrVcept.^ 
Ld^^reccptlb 
SRCR.' . 
Sushi 
tsp.l 
Vwa 
Vwc 
Vwd 

14-3-3 
Artk 

Arma'dillo^seg . - 
C2 

cNMP^binding 
DnaJ^C 
DnaJ 
Efhand** 
FCH 
FF 
FHA 



RhoGAP domain 
RhoCEF domain 

SAM domain (Sterile alpha motrQ 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homolog/ 3 (SH3) domain 
STAT protein 
VHS domain 
WHldomain 

Domains Involved In apoptosis 

Bci-2 homology region 4 
. Caspase recruitment domain 
Death domain 
Death effector domain 
Domain present In Hsp70 regulators 
ICE-Uke protease (caspase) p20 domain 
Inhibitor of Apoptosis domain 

. Cytoskeletat 

Actin 
Annexin 
Calponin family 

FERM domain (Band 4.1 family) 
• NebuUn repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
troponin 

ViUin headpiece domain 
Vrnculln family : . . 

. ■ ECM adhesion 

Collagen triple helix repeat (20 copies) 
C-terminal tandem repeated domain in type 4 

procollagen 
CUB domain 
ECF-like domain 

Fibrinogen beta and gamma chains, C-terminal 

globular domain 
Fibronectin type III domain 
Furin-Uke cysteine rich region 
Integrin alpha ^oplasmlc region 
Integrins, beta chain • • 

Laminin B (Domain IV) - • ■ 
Laminin EGF-Iike (Domains lll and V) 
Laminin G domain 

Laminin N-termlnal (Dornain VI)' ' ' 
Lectin C-type domain ^ 1 - • ■ 

//Leucine rich repeat C-terminal domain 
. -• Leucine rich repeat N-terminal domain * ' 
Low-density lipoprotein receptor domairj class A 
Low-density lipoprotein receptor repeat class B 
Scavenger receptor cysteine-rici) domain 
"$ushf domain (SCR repeat) ' • - 
, Thrombospondin type 1 domain^ ^ 
von Willebrand factor type A domain . • 

von Willebrand factor type C domain 
von Willebrand factor type D domain • / >. ■ 

Protein Interaction domains 

14-3-3 proteins: 
Ank repeat \ \ ^ 
Armadillo/beta-catenin-like repeats 
C2 domain . • ♦ ■ 
Cyclic nudeotide-bipding domain 
DnaJ C terminal region 
DnaJ domain 
EF hand 

Fes/CIP4 homology domain 
FF domain 
FHA domain 



59 


19 


20 


9 


8 


46 


23 (24) 


18(19) 


3 


0 


29(31) 


15 


8 


3 


■■■ 6 


13 


5 . 


5 


5 


9 


. 87(95) 


33(39) ^ 


44(48) 


1 


3 


143 (182) 


55(75) 


46(61) 


23(27) 


4 


7 


1 




• 0 


0 


4 


2 


4 


4 


8 


7 


2 • 


.2(3) 


1 


0 


• 9 


2 . 


1 


0 


0 


' 3 


• 0 


1 


. 0 . 


.0 


■ 15 


0 


■2. 


0 ■ 


0 


16 


5 


7 


0 


0 


4(5) 


0 


0 


0 


0 


5(8) 


3 


2 


1 


5 


11 


7 


3 


0 


0 


8(14) 


5(9) 


2(3) 


1(2) 


0 



61(64) 
16(55) 
13(22) 
29(30) 
4(143) 
2(11) 
31 (195) 
4(12) 
4 
5 

4 

65(279)- 
.6(11) 

-47(69) 
108 (420) 
26 

106 (545) 
5- 
3 
8 

8(12) 
24(126) 
30(57) 
10 
47(76) 
69(81) 
40(44) 
35(127) 
15(96) 
.11(46) 
53(191) 
41(66). . 
34(58) 
19(28) 
15(35) 



15(16) 
4(16) 
3 

17(19) 
1(2) 
0 

13(171) 
• 1(4) 
6 
2 
2 

/ 10(46) ' 
2(4) 

9(47) 
45(186) 
10(11) 

42(168) 
2 
1 
2 

4(7) 
?(621 
18(42) 
6 

23(24) 
23(30) 
7(13) 
33(152) 
9(56) 
4(8) 
11(42) 
,11(23) 
0 . 
6(11) 
3(7) 



12 
4(11) 
7(19) 
11(14) 
1 
0 

10(93) 
2(8) 
8 
2 

1. 

. 174^384) 
3 (6) 

43(6?) 
54(157) 
6 

34(156) 
1 
2 
2 

6(10) 
11(65) 
14(26) • 
4 

91 (132) 
7(9) 
3(6) 
27(113) 
7(22) 
1(2) 
8(45) 
18(47) 
17(19) 
2(5) 
• 9 



9f11) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

. 0 
0 

0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 



24 
6(16) 
0 
0 
0 
0 
0 
0 
0 
5 
0. 

0 . 
0 

o 
1 

0 

1 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 .. 
"6 

0 
0 

1 

0 
0 



• 20 
145(404) 
22(56) 
73(101) 
26(31) 
. 12 
44 

83(151) 
9 

4(11) 
13 



3 


3 


2 


15 


72(269) 


75(223) 


12(20) 


66(111) 


11(38) . 


3(11) 


2(10) 


25(67) 


32(44) 


24(35) 


6(9) 


" 66(90) 


21(33) 


15(20) 


2(3) 


22 


9 


5 


3 


19 


34 


33 


20 


93 


64(117) 


41(86) 


4(11) 


• 120(328) 


3 




4 . 


0 


4(10) • 


3(16) 


2(5) 


4(8) 


15 


7 


13(14) 


17 
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myelin proteins result in severe dcmyelina- 
don, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (130). Humans 
have at least 10 genes belonging to four 
dtfTereat families involved in myelin produc- 



The Human genome 

tion (five myelin PO, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remolely. related members of the 
MOG family. Flies have only a single myelin 
proteolipid, and worms have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
Many protein fanillies that have expanded in 
•. humans relative to the invertebrates are m- 
•volved in signaling processes, particularly in 
response to development and differentiation 



Table 18 [Continued) 


Accession 

• number . 

• * 


•■ Domain name 


• • ■ - . • •• *.*'■..' • 
" ^ • Domain description. ■ • ' v . . 

• ^ • • ■ . • . ■ . - " 


• . H \ 
•• ' " ■• *••• 


■' ^ ' ■ 


V* >^ . '- 


.■ 

••*:"•. 


' [ A -.. . 


. PF00254 


FKBP 


FKBP-type peptldyl-prblyt cts-trans Isomerases 


1.5(20) 


7(8) 


7(13) 


• 4 


24(29) 


PfOl590 


CAF . 


CAF domain 


7(8) 


2 (4) 


1 


0 


10 


Pf01344 


Kelch 


Kelch motif • 


54(157) 


12(48) 


13(41) 


3 


102(178) 


PF0O560 




Leucine Rich Repeat 


25(30) 


24(30) 


7(11) 


1 


15(16) 


PF00917 


MATH 


MATH domain 


11 


5 


88 (161) 


1 


61 (74) 


Pf00989 


PAS 


PAS domain 


18(19) 


9(10) 


6 


1 


13(18) 


PF00S9S 


PDZ 


PDZ domain (Also khown as DHR or GLGF) 


96(154) 


60 (87) 


46(66) 


2 


5 


PF00169 


PH 


PH domain 


193 (212) 


72(78) 


65 (68) 


24 


23 


PF01535 


PPR** 


PPR repeat 


5 


3(4) 


0 


1 


474 (2485) 


PF0O536 


SAM 


SAM domain (Sterile alpha motif) 


29 (31) 


15 


8 


3 


6 


PF01369 


S€c7 


Sec7 domain 


13 


5 


5 


5 


9 


PF00017 


SH2 


Src homology 2 {SH2) domain 


87(95) 


33(39) 


44(48) 


1 


3 


PF00018 


SH3 


Src homology 3 (SH3) domain 


143(182) 


55(75) 


46(61) 


23 (27) 


4 


t% ^ mm MJ\ 

PF01740 


STAS- 


STAS domain 


5 


1 


6 


2 


13 


rruu^ 1^ 


1 ri\ 


TDD JamvIm 

TPrC Gomain 




ao rirti\ 


28 (54J 


16 13i; 




PF00400 


WD40** 


WD40 domain 


136 (305) 


98 (226) 


72(153) 


56(121) 


167(344) 


PF00397 


WW 


WW domain 


32(53) 


24(39) 


16(24) 


5(8) 


11(15) 


PF00569 


72 


Z2-2inc finger present In dystrophin, CBP/p300 

Nuclear interaction domains 


10(11) 


13 


10 


2 


10 


PF01754 


2f-A20 


*A20-Iike 2lnc finger 


2(8) 


2 


2 


0 


8 


PF01383 


ARID 


ARID DNA binding domain 


11 


6 


4 


2 




PF01426 


BAH 


BAH domain . 


8(10) 


7(8) 


■ 4(5) 


5 


21 (25) 


PF00643 


Zf-B_box*' 


B-box zinc finger 


32(35) 


1 


2 


0 


0 


PF00533 . 


BRCT . 


BRCA1 C Terminus (BRCT) domain 


17(28) 


10(18) 


. 23 (35) 


10(16) 


12(16) 


PF0d439 


Bromodomain 


Bromodomain "* • . 


37(48) 


16(22) 


18(26) 


10(15) 


28 


Pfcwesi 


BTB 


BTB/POZ domain . 


97 (98) 


62 (64) 


86(91) 


. .1 (2) 


30(31) 


PF00145* 


DNA»methylase 


C-5 cytosine-specific DNA methylase 


3(4) 


1 


0..- 


0 


13(15) 


PF0O3a5 


Chromo 


chromo* (CHRromatin Organization Modifier) 


24(27) 


14(15) 


17(18) 


M (2) 


12 



PF0012S 
PF00134 
PF00270 
PF01529 
PF00645 
PF0O250 
PF0O320 
PF01585 
PF00010 
PF00850 
PF00046 
PF01833 
PF02373 
PF02375 
P@XK>13 
PF01352 

pf6oio4 



Histone 

CycUn 

DEAD 

Zf-DHHC 

F-box** 

ForK.head .. 

GATA 

G-patch 

HLH** 

His^deacetyl ^ - 

Homeobox 

TIG 

JmJC ^ 
JmJN' 

KH-ddmalrv. - 
■ KRAB 
Hormone^eo • 

UM 
MATH 

Myb_DNA-bInding 

Myc-LZ 

2f-MYND 

PHD 

Pou 

RFJCDNA^blnding 
Rrm 

SAP 
SPRY 
START 
T-box 



domain • 
Core histone H2A/H2B/H3/H4 ^ 
CycUn 

DEAD/DEAH box heUcase 
DHHC zinc finger domain 
F-box domain 
Fork head domain 
CATA zinc finger 
tj-patch domain 

Helix-loop-helix DNA-blnding domain 
Histone deace^lase family - 
Homeobox domain _ . 

IPT/TIG domain 

JmjC domain ... - ,. ' ' — • •* 

JmJM' domain 

KH-ddmaIn " 

KRAB box . 

Ugand-blnding domain of nuclear hormone 

receptor . 
UM'Somalh containing proteins 
•MATH domain ' 
r^b-Dke DNA-blndIng domain 
Myc leucine lipper domain 
MYND finger 
PHD-finger 

Pou domaliv— N-termlnal to homeobox domain 
RFX DNA-blnding domalp . > 
- RNA recognition motif (aJca. RRM, RBD, or RNP 

domain) 
SAP domain 
SPRY domain 
START domain 
T*box 



75 (81).- 


5 


71 (73) 


8 


48 


19 


10 


10 


11 


35 


63 (66) 


48(50) 


55(57) 


50(52) 


84(87) 


15 


20 


16 


7 


22 


16 


15 


309 (324) 


9 


165(167) 


35 (36) 


20(21) 


15 


4 


0 


11(17) 


5(6) 


8(10) 


9 


26 


18 


16 


13 


4 


14(15) 


60(61) 


44 


24 


4 


39 


12 


5(6) 


8(10) 


5 


10 


160(178) 


100(103) 


" 82(84) 


6 


66 


29 (53) 


11(13) 


5(7) 


2 


1 


10 


4 


6 


4 


7 


7 


4 


2 


3 


7 


28(67) 


14(32) 


17(46) 


4(14) 


27(61) 


204(243) 


0 


0 


0 


0 


47 


17 


142(147) 


0 


0 


62(129) 


33(83) 


33(79) 


4(7) 


10(16) 


11 


, 5 


88(161) 


1 


61 (74) 


.32(43) 


18(24) 


17(24) 


15(20) 


243 (401) 


. 1 


. 0 


0 


0 


0 


: ' - 14 


14 


9 


1 


7 


68 (86) 


40(53) 


32(44) 


14(15) 


96 (105) 


15 


5 


4 


0 


0 


7 


2 


1 


1 


0 


224(324) 


127(199) 


94(145) 


43(73) 


232 (369) 


15 


8 


5 


5 


6(7) 


44(51) 


10(12) 


5(7) 


3 


6 


to' 


2 


6 


0 


23 


17(19) 


8 


22 


0 


0 
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^ Table 18 {ConUnued) 



Accession 
number 



Domain name 



' Oonnam description 



H 



W 



PF02135 
PF01285 
PF02176 
PF00352 

PF00567 
PF00642 
PF00096 
PF00097 
PF00098 



ZMAZ 
TEA 

2f-TRAF 
TBP 

TUDOR 

Zf-CCCH 

2f-C2H2** 

2f-C3HC4 

Zf-CCHC 



TA2 finger 

TEA domain 

TRAF-type zinc finger 

Transcription factor TFIID (or TATA-blnding 

protein, TBP) 
TUDOR domain . • 

Zinc finger C-x8-C-x5-C-x3-H type (and simitar) 
Zinc finger, C2H2 type 
Zinc Ung^r, C3HC4 type (RING finger) 
Zinc knuckle 



2(3) 
4 

6(9) 
2(4) 

9(24). 
17(22) 
564(4500) 
135(137) 
9(17) 




1(2) 6(7) 0 

1 11 

1(3) 1 0 

4(8) . 2(4) 1(2) 

9(19)' 4(5) 0 2 

6(8) 22(42) 3(5) 31(46) 

234 (771) 68 (155) 34 (56) 21 (24) 

.Jl ^^(^^^ ^® 298(304) 

6(10) 17(33) 7(13) 68(91) 



(Tables 18. and 19). They include secreted 
hormones and growth ifactors, receptors, in- 
tracellular signaling molecules," and transcrip- 
tion factors. 

Developmental signaling molecules that are 
enriched in the human genome include growth 
factors such as wnt, transfomiing growth fac- 
tor-p (TGF-P), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephriins. These growth fac- 
tors affect tissue diiFerehtiation and a wide 
range of cellular processes involving actin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding recq)tors of these developmental 11- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human ' 
cphrin genes (2 in the fly, 4 m the worm) and .12 
cphrin receptors (2 in the fly, 1 in the worm). In . 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 fiizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of Aese adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, rporphogenesis, and ^iissue repair 
{131), Consistenrwith the welf-defined role of 
heparan sulfate^ proteoglycans in. modulating ' 
these interactions: {132), we obserye an expan- 
* sioiiofthe heparin sulfate sulf(^rainsferases in 
the human genome relative" tg wonn and fly. 
These sulfotransferases modulate tissue differ- 
entiation {133), A similar expansion in humans 
Is noted in structural proteins that constitute the • 
actin-cytoskeletal architecture. ODmpared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- 
tein on average), aggrccan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskelelon with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



. . Comparison across the'.flve sequenced eu- 
karyotic orgarusms revealed several expand- 
ed protein families and domains involved in 
cytoplasmic signal transduction (Table 18). 
In particular, signal transduction pathways 
playing roles in developmental regulation and 
acquired immuruty were substantially en- 
riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GTP 
exchange factors associated with them. Al- 
though there are about the same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
. the SPI2, PTB, and ITAM domains involved 
" iri phosphotyrosine signal transduction. Fur- . 
: ther, there is a twofold expansion of phos- 
. phodiesterases in..the human genonie. com- 
pared with cither the worm or fly genomes. 

The downstream 'effectors of the intracellu- 
lar signaling molecules include the transcription 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- 
. binding nuclear hormone receptor class of tran- ■ 
. scription factors compared with the fly genome^ 
although not to the e/ctent observed in the worm 
(Tables - 18 and 19). Perhaps the most striking 
exparision in humans is in the C2H2 zinc finger 
transcription factors. Pfam delects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins; <*dmpaJ-ed with 771 in 234 fly proteins. 
This means that , there has been a dramatic 
expansion not only^ in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in hurnans, 3.3 on average 
in the fly, aiid 2.3 on average in the worm)/ 
Furthermore, many of these franscription fac- 
tors contain either the KRAB* dr SCAN do- 
. mains, which are not foxind in the fly or worm 
genomes. These domains arc involved In the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
ifactors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortm^t of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



horaeodomains alone or in cbmbmation with 
Pou and LIM domains in all of the animal 
genomes. In plants, however, a different set of 
transcription factors are expanded, namely, the 
myb family, and a unique set that includes VPl 
and AP2 domain-containiiig proteins {134). 
The yeast genome has apauci^ of transcription 
factors compared with the multicellular eu- 
karyotes, and its repertoire is limited to the 
' expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation^ 
While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it should be noted that 
. most of the protein domains are highly con- 
served. An interestmg observation, is that " 
worms and humans have approximately the 
same number of both tyrosine kinases and 
serine/threonine kinases (Table 19). It is im- 
. portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
that contain these domains also display a 
■ wide repertoire of interaction domains with 
' significant combinatorial diversity. 

Hemostasls. Hemostasis is regulated pri- 
: marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral .to hemostasis are 
e;^anded in the human relative to the fly and 
worni (Tables 18 and 19), We note the evolu- 
tion of domains such as FIMAC, FNl, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In additiori, there , has been extensive re- 
criiitment of morie-ancient axiiinal-speciflc 'do- ; 
mains such as VWA, VWC, VWDj Jcringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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significant expansion in two families of matrix 
metalloproteases: ADAM (a disintegrin and 
metaUoprotease) and MMPs (matrix metaUo- 
proteases) (Table 19). Proteolysis of extracel- 
lular matnx (ECM) proteins is critical for tissue 
development and for tissue degraddtfon in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 
?S ^'fj'f i^ammatory conditions 
il35 136). ADAMs are a family of inteeral 
membrane proteins with a pivotal role in fibrin- 
ogenolysis and • modulating inieractions -be-- 
tweeo hematopoietic compbhents! afid the ' 
vascular matrix components. These proteins 
have been shown to cleave matrix proteini 
and even slgnalinjg molecules: ADAM-I7 
°«<=™sis factor-o, and 
ADAM-10 has been implicated in the Notch 
signalmg pathway (753). We have identified 
19 members of the matrix metaUoprotease 

°^ of the 

^AM and ADAM-TS families. 

Apoptosis. Evolutionaiy conserraiion of 
ome of the apoptotic pafliway components 
cross eukaiya is consistent with its central 
3le m developmental regulation and as a 
wpoDse to pathogens and stress signals The 
gnal transduction pathways involved in pro- 
-ammed cell death, or apoptosis. are medi- 
ed by mteractions between' well-character- 
ed domains that include extracellular do- 
ams, adaptor (protein-protein interaction) 
mams, and those found in effector and 



THE Human. GENOME 

in diverse human pathology ranging from 
allergic responses to cancers. One of the most 
suipristng human expansions, however, is in 
• the number of gIyceraldehyde-3-phosphate 
dehydrogenase (GAPDH) genes (46 in hu- 
mans, 3 in the fly, and 4 in the worm). There 
IS, however, evidence for many retrotrans- 



posed GAPDH pseudogenes {139), which 
may account for this apparent expansion. 
However, it is interesting that GAPDH, Ion<' 
known as. a conserved enzyme involved i^ 
basic metabolism found across all phyla from 
bactena to humans, has recently been shown 
to have other functions. It has a second cat- 



' Panther famHy/subfamily* 



W 



Ependymin 
Ion channels 
AcetylchoUne receptor 

Amnonde-sensitive/deffenerfn 
CNC/EAG 
IRK 

ITP/iyanodme 
Neurotransmltter^gated 
P2X punnoceptor 
TASK 

Transient receptor 
Volta^e-galed Ca'+ alpha 
Voltage-gated Ca^"^ atpha-2 " 
Voltage-gated Ca*"^ beta 
Voltage-gated Ca** gamma 
Voltage-gated K"^ alpha 
Voltage-gated KQT 
Voltage-gated Na"^ 
Myelin basic protein 
.My^Wxx PO . 

■xiiazi.,,, ana inose lound in effector and' ; Myelin proteoUpld 
dilatory enzymes Q37), We enumerated Mye^'n-oUgocfendrocyte glycoprotein 
: protein counts of central adaptor and ef- -'^europmn 
•tor enzyme domains that are found only in 
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apoptotic pathways to provide an estimate 
divergence across eukaiya and relative 
mansion in the human genome when corn- 
ed with the fly and worm (Table 18) 
aptor domains found in protems restricted 
f to apoptotic regulation such as the DED 
lains are vertebrate-specific, whereas oth- 
like BIR, CARD, and Bcl2 are represent- 
1 the fly and worm (although thi number 
cI2 family members in humans is signif- 
tly expanded). Although plaits and yeast 
the caspases, caspase-Iikc molecules, 
i\y the para- and meta-caspases, hay^' 
reported in these tfrgam'sms (/i<?). Com-. 
I with other animal genomes, the human 
me shows an expansion, in' the adaptor 
rffector domain-<:£fntalniiig proteins in- 
d m apoptosis, as well as in the-prb- 
; mvolvcd in the cascade such as, the 
se and calpain families. ' 
pansions of other protein families. 
o^j^^n^ymes. There are fewer cyto- 
e P450 genes in humans than in either 
' or worm. Lipoxygenases (six in hu- 
on the other hand, appear tg be specific 
vertebrate's and plants, whereas the Up- 
^-activating proteins (four in humans) 
? vertebrate-specific. Lipoxygenases are 
Jd m arachidonic acid metabolism, and 
d their activators have been implicated 



.NeuropiUn.. 
- Plexln ' • - 
Semaphorin . 
Synaptotagmfn 



Defensm • 
CytoWnef 
. CCSF 
GMCSF 

Intercrine alpha 
Intercnne beta 
Inteferon 

InterleuJdn • * 

Leukemia Inhibitoiy factor 
MCSF . - 

Peptidoglycan recognition protein 
Pre-B cell tr\h^ndt\g fartor 
..Small Indudble.tyl5dn?A • '' ' 
SI cytokine . • , • 

- TNF . ^ * 

Cytokine receptorf , 
• Bradyklnln/C-C chemokine receptor 
H cytokine receptor ; ' " ' 
Interferon receptor • 
Inlerleukin receptor ^ -* 
Leukocyte tyrosine kinase 
receptor 
. MCSF receptor 
-TNF receptor • 
Immunoglobulin receptorf 
T-cell rfejceptor alpha chain 
T-cell riBccptor beta chain 
T-cell receptor gamma chain 
T-ceU receptor delta chain 
Immunoglobulin FC receptor 
Killer celt receptor 
Polymerfc-Immunoglobulin receptor 



1 


0 


17 


12 


11 


24 


22 


9 


ifi 


3 


10 


2 


61- 


51 


10 


0 


12 


12 


15 


3 


22 


4 


10 


3 


5 


2 


1 


0 


33 


5 


6 


2 


11 


4 


1 


0 


5 


0 


3 . 


1 . 


1 


0 


: ■ 2 ' 


0 


9 


2 


22 


6 


10 


3 


Immune response 


3- 


0 


86 


14 


1 


0 


1 


0 


15 


0 


5 


* 0 


8 


0 


26 


1 


1 


0 


1 


0 


2 


13 


1 


0 


14 


0 


2 


0 


9 


b 


62 


1 


7 


0 


2 


0 


3 


. 0 


.32 . 


/ 0 ' 


3 


0 


1 


0 


3 


0 


59 


0 


16 


0 


15 


0 


1 


0 


1 


0 


8 • 


0 


16 


0 


4 


0 



56 


0 


0 


C.7 


0 


0 


9 


0 


30 


3 


0 


0 


4 


0 


0 


59 


0 


19 


0 


0 


0 


48 


1 


5 


3 


1 


0 


8 


2 


✓ 2 


2 


0 


0 


2 


0 


0 


0 


•0 


0 


11 


0 


0 


3 


0 


0 


4 


9 


1 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


. 0 


0 


• 0 


0 


0 . 


0 


b 


2 ' 


0 


0 


3 


0 


0 



0 

1 

0 
0 
0 
0 
0 

1 

0 
* 0 
0 
0 
0 

0 . 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 

p 

0 
0 
0 
0 
0 
0 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0* 
. 0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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0 

0 

0 

0 

0 

0 

0 

0 
0 

o 

0 
0 
0 
0 
0 
0 
0 
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alytic activity, as a uracil DNA glycosylase 

(140) and functions as a cell cycle regulator • 

(141) and has even beegt implicated in apo- . 
ptosis (J42). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translational machinery. 
We identified 28 different ribosomal subunits 
.that each have at least 10 copies in the ge- 
nome; on average, for all ribosomal proteins . 
there is about an 8- to 10-fold expansion in 
the number of genes relative to either the 
worm or fly. Retrotransposed pseudogenes . 

%. • ^ ■ • ■ 
Table 19 (Conttnued) 
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; may account for many of these expansions 
. .[see the discussion above and (1 43)]. Recent 
evidence suggests that a number of ribosomal 
. proteins have secondary functions indepen- 
dent of their involvement in protein biosyn- 
thesis; for example, LI 3a and the related L7 
subunits (36 copies in humans) have been 
. shown to induce apoptpsis (J 44). 

: There is also a four- to fivefold expansion 
in the elongation factor 1 -alpha family 



(eEFlA; 56 human genes). Many of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 



Panther family/subfamily* 



H 



W 



MHCdassI 

MHC class II 

Other fmmunogtobutint 

Toll receptor-related 



22 
20 
114 
10 



0 
0 
0 
6 



0 
0 
0 
0 



0 
0 
0 
0 



Signaling moleculesf 
Caldtonln 
Ephrin 
FCF 

Glucagon 

Glycoprotein hormone beta chain 
Insulin 

Insulin-like hormone . 

Nerve growth factor 

Neuregulin/heregulin 

neuropeptide Y 

PDGF 

Retaxin 

Stannocaldn 
Thymopoeltin 
Thyomosin beta 
TCF-jp 
VEGF 
Wnt 
Receptorsf 
Ephrin receptor _ 
FGF receptor 
Frizzled receptor 
Parathyroid honmone receptor - 
VEGF receptor 

BDNF/NT-3 nen^e growth factor 
receptor 



Developmental and homeostatic regulators 



3 


0 


0 


0 


0 


8 


2 


4 


0 


0 


24 


1 


1 


0 


0 


4 


0 


0 


0 


0 


2 


0 


0 


0 


0 


1 


0 


0 


0 


0 


3 * 


. 0 


0 


0 


. 0 


3. 


0 


0 


0 


0 


•6 '■■ 


0 . 


0 


• 0 


0 


4 


0 


0 


0 


0 


1 


1 


0 


0 


0 


3 


0 


0 


0 


0 


2 


0 




0 


0 


2 


0 


1 


0 ''' 


0 


4 


2 


0 


0 


0 


29 


6 


• 4 


0 


0 


4 


0 


0 


0 


0 


18 


6 


5 


0 


..p 


12 


2 


1 


0 


0 


4 


. 4 


* 0 


0 


0 


12 


6 


5 


0 


0 


2 


- 0 


0 . 


0 


0 


5 


0 


0 


0 


0 


4 


' 0 


/ 0 ' 


0 


0 





Kinases and phosphatases 








Dual-specifidty protein phosphatase 


- 29 


8 


10 


4 


, 11 


S/T and dual-spedfidty protein 










kinaset A" 


395 


198 


315 


114 


. 1102 


S/T protein phosphatase — ' 


15 


19 


.51 


13 


29 


Y protein klnasef 


106 


47 


100 


5 


16 


Y protein phosphatase 


56 


22 


95 


.. 5 


fe 


ARF family 


Signal transduction 








. 55. 


. 29 


. 27 


, -12 


45 


CycUc nudcotlde phosphodiesterase . 


• 25 


8 


■ 6 


1 


0 


G protein-coupled receptorstt . 


.616 . 


146 


284 


. 0" 


1 


G-proteIn alpha * 


27 • 


10 


22 


2 


5 


G*protetn beta 


5 


3 


2 


1 


1 


G-proteln gamma 


13, 


2 


2 


0 


0 


Ras superfamlly 


141 


64 


62 


26 


86 


C-proteIn inodulatorst 












ARF GTPase-activating 


20 


8 


' 9 


5 


15 


Neurofibromln 


7 


2 


0 


2 


0 


Ras GTPase-acttvatlng 


9 


3 


8 


1 


0 


Tuberin 


7 


3 


2 


0 


0 


Vav proto-oncogene family 


35 


15 


13 


3 


0 



• transposition, and again there is evidence that 
many of these may be pseudogenes (J4S). 
However, a second form (eEFlA2) of this 
factor has been identied with tissue-specific 
expression in skeletal muscle and a comple- 
. mentary expression pattern to the ubiquitous- 
ly expressed eEFl A (146). 

. Jfiibonucleoproteins.':AltemSi^ve sph'cing 
results in .multiple transcripts from a single 
gene, and can therefore generate additional 
diversity in an organism's protein comple- 
ment. We have identified 269 genes for ri- 
bonucleoproteins. This represents over 2.5 
times the number of ribonucleoprotein genes 
in the worai, two times that of the fly, and 
about the same as the 265 identified in the 
Arabtdopsis genome. Whether the diversity 
of ribonucleoprotein genes in humans con- 
tributes to gene regulation at either the splic- 
ing or translational level is unlcnown. 

Posttranslational modiJicatioTis, In this 
set of processes, the most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K- dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (rm'ssing in the 
fly and worm) found in coagulation factors, 
osteocalcin, and matrix GLA protem (148). 
Tyrosylprotein sulfotransfcrases participate . 
in the posttanslational modification of pro- 
teins involved in infiamrnation and hembsta- 
sis, including coagulation factors and chemo- 
kine receptors (149). Although there is no 
sigm'ficant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two hlstone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a fe'ature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There arc several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate tp the. prominent differences in 
the imiriune system, hemostasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding ^at the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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increase in the ability to mediate proteLa- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plemenr(y50. Evolution of apparently new 
(frorn the perspective of sequence analysis) 
protem domains and increasing- Jegulatory 
complexiryr by domain accretion both quanti- 
tetively and qualitatively (recniitment of nov- 
cl domains with preexisting ones) are two 
features that we. observe in humans. Perhaps * 
the best Illustration of this trend is the C2H2 ' 
zmc finger-containing 'transcription factors:' 
where we see expansion in the number of 
domams per protein, together with verte- 
brate-specific domains such ds KRAB and 
SCAN. Recent reports on the prominent use 
Of mtemal ribosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins sugg<,sts that this is an area 
that needs fiirther research to identify'the full 

U^J). At the posttranslational level, although 
we proWde examples of expansions of some • 
protem families involved in these modifica- 
tions, further experimental evidence is re- 
quired to evaluate whether this is correlated 
with mcreased complexity in protein process- 
ing. Posttranscriptional processing and the' 
extent of isoform generation in the±uman 
reraam to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
:hinery, further analysis will be required to" 
iisscct regulation at this level. 

^ Conclusions 
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Table 19 (Continued) 

Panther famUy/sub/amiIy» " 



W 



COE^ finger-containingt 
CREB 

ETS-related 
Forkhead-related 

fOS \. . 
Croucho 

Histone HI * " ' 
■ Histone H2A 
Histone H2B 
Histone H3 ' 
Histone H4 
Homeoticf 
ABD-B 
Bithoraxofd 
Iroquois class 
Distat-Iess 
Engrailed 
LIM-contaJning 
MEIS/KNOX class 
NK-3/NK-2 class 
Paired box 
Six 

Leucine zipper 
Nuclear hormone receptorf 
Pou-related 
Runt-related 



Transcfiption factors/chromat/n organization 



.1 The whole-genome sequendng 
pproach versus BAC by BAG 
xperience in applying the whole-genome 
lotgun sequencmg approach to a diverse 
oup of orgam'sms with a wide range of 
nome sizes and repeat content allows us to 
sess Its strengths and weaknesses. With the 
ccess of the method for a large nilmber of 
crobial genomes, Drosophiia, and now the 
man, there can be no doubt -concerning the 
hty of this method. The large number of 
:robial genomes that have been sequenced 
this method (/J, 80/152) demonstrate'that 
gab^e-sized genomes, can be sequenced 
ciently without any input other that the de 
•0 mate-paired seqa^riOesV Witli more 
iplex genomes like those of DrosophHaov 
lan, map information/in the form of well- 
ired markers, has been critical fo/lMg- 
:e ordering of scaffolds. For joining scaf- 
s mto chromosomes, the quality of the 
(in tcraisofthe order ofthe markers) is - 
5 important than the number of markers 
se. Although this mappbg could have 
perfomied concurrently with sequenc- 
the prior existence of mapping data was 
Rcial. During the sequencing ofthe A, 
ma genome, sequencing of individual 
clones pcnniltcd extension of the se- 



Cadherin 

, .Complement receptor-related 
Connexin.. 
-Galectin . , ' 
"Glyplcan *. 
ICAM 

Integrin alpha 
Integrin beta 
LDL receptor family 
Proteoglycans 

Bcl-Z 
CalpaFn 

Calpaln inhibitor 
Caspase 

ADAM/ADAMTS *' 
Fibronectin - * " 

Giobin • . ■ ' ^ 

Matrix melalloprdtease 

Serum amyloid A 

Serum amyloid P (subfamily of 

PentaxJn) ' • ' 
Serum paraoxonase/aiytestefaise 
Serum albumin * 
Transglutaminase 



607 
7 
7 
25 
34 

b' 

■ ■ 13 • 
5 

. 24 • 
21 
28 
• 9 
168 
5 
1 
7 
5 
2 
17 
9 



• 232 
1 
1 
8 
19 
2 

:\2 

0 

1 
1 

2 
1 

104 
0 
8 
3 
2 
2 
8 
4 
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4 


38 


28 
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59 


25 


15 


5 


3 


4 


£CM adhesion 




113 


17 


20 


0 


. 22 


8 


14 


0 


12 


5 


13 


2 


6 


0 


24 


7 


9 


2 


26 


19 


22 


9 



Apoptosis 
12 
22 
4 
13 

Hemostasis 
51 
3 
10 
19 
4 
.2 



1. 
4 
0 
7 

9 
0 
2 
2 
0 
0 



Q^ochrome p450 
CAPDH . - 

Heparan lulfotransferase 

EF-1alpha ' 
Ribonucleoprotelnsf 
Ribosomal proteins} 



4 0 
4 0 

10 . . - .1, 

'Other emymes 

... -60 ' 89 
46 3 

11 4 
SpUdng and translation ' 

56' .13 
269 135 
812 in 



79 
1 

2 
10 
15 
• - .1 

■ ■1 ; 
1 

17 
17 
24 
16 
74 
0 
1 
1 
1 
1 
3 
4 
5 
23 
4 
0 
183 
4 
2 

16 
0 
6 

0 - 
22 

1 - ' 
0 

4 
2 
20 
7 



28 
O 
O 
0 

' . 4 

O - 
>. o ' - 

. 0 

3 

2 

2 

1 
4 
O 
0 
O 
0 



0 
O 
2 
0 

o. 

0 
0 

1 
1 

0 



0 
0 
0 
0 
0 
0 

• o 

0 

o 

0 
0 



8 
0 

o 

0 

b 
o 

0 
0 
13 
12 
16 
8 
78 
0 
O 
0 
0 
0 
0 
26 
0 
2 
0 
0 
4 
0 
0 
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0 
0 
0 
0 
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0 
2 
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11 
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' 12 


0 
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0 
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. p 


0 
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0 
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0 


• 0 


. 0 


0 


0 


83 


3 


256 


4 


3 


8 
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0 


'10 


6 


13 


104 


60 


265 


80 


117 


256 



(TAU 18) cr (10 differ In counts from the co" «^^rprrn mod^^^^ V«cIficaUy r.pr«ent«d by Pfam 

hmlU« In the Mm* Panther mol«uUr fcnrf^'^C«o^ tnis .J^^ 1*!f «P««ntJ a number of different 
diss, and metabotropIegtuUmatedaa CPCto HiodopjIn-cUis. jecretln- 
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quence well into ccntromeric regions and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, 'in Drosophila, the 
BAG physical map was most useful in re- 
gions near' the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quaJity reconstructions of the 
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predicting genes should limit this nimiber. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessaiy to measure 
mRNA in specific cell types to demonstrate 
the presence of a gene. 

■ J. B. S. Haldane speculated in 1937 that a 
population of organisms might, have to pay a 



umque regions of the genome. As the genome . ., price for the number of genes it can possibly 



size, and more importantly the repetitive con 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
clone approaches makes them difficult to justily 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific" applica- 
tions of B Ac-based or other clone mapping abd 
sequencing strategies to resolve ambiguities in . 
sequence assembly that cannot be cfficiendy-. 



cany. He, theorized that when the number of 
genes becomes too large, each zygote carries 
so many new deleterious mutations that the 
population simply caiuiot.maintab itself. On 
the basis of this premise, and on the basis of 
available mutation rates and x-ray-induced 
mutations at specific loci, .Muller, in 1967 
(/5^), . calculated that the . mammalian ge- . 
nome would contain a maximum of not much 
more than 30,000 genes (75J). An estimate of 



resolved vath computational approaches alone . , 30,000 gene loci for humans was also arrived 



are clearly worth exploring. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAG clone se- 
quencing phase.. Our experience with numan 
genome assembly suggests that this will require 
at least 3 X coverage of botLwiole-genome and 
BAG shotgun sequence data. 



8.2 The low gene number In humans 
Wc have sequenced and assembled ^95% of 
the cuchrpmatic sequence of H. sapiens and 
used a new automated gene prediction meth- 
od to* produce a preliminary catalog of the 
hiiman genes. This has provided a major sur-.. 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed armotation, comparative genomics 
(particularly using the Mus muscutus ge-. 
nome), and carefiil molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will pccur in. the 
years to c6me as the prccFse smicture of each 
^ transcription unit is evaluated: A good place 
to start is to deternline why the gene csti- • 
mates derivetf irorn EST data are so discor- 
dant with our predictions. It is lUcely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5'-iwtranslatcd leaders and trailers; 
. the little-understood vagaries of RNA pro- 
cessing that often leave intronic regioris in an 
unspliced condition; the finding that nearly 
40% of human genes are alternatively spliced 
(/ii); and finally, the unsolved technical 
probleins in EST library constniclion where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
. that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



at by Crow and Kimura {156), Muller*s esti- 
mate forZ>. melanogaster V/2LS 10,000 genes, 
compared to. 13,000 derived by annotation of 
the fly genome {26, 27). These arguments for 
the theorerical maximum gene number were 
. based on simplified ideas of genetic load — 
that all genes have a certain low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
cemible phenotypic perturbations. * . 

The modest number of human genes - 
means that we must look elsewhere for the . 
. mechanisrns that gerierate the complexities . 
..inherent in. human development and. the so-:.- 
phisticated . signaling systems that fnaintain = 
homeostasis. Hierc are a large number of 
ways in which the functions of individual 
genes and gene products are regulated. The 
degree of "openness" of chromatin structure.' . 
and hence transcriptional activity is regulated 
by protein complexes that involve histone 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The locatioii, -timiiig, and quantity of tran- 
scriprion are intirnately linked to nuclear sig- 
nal transduction events as well as by the 
' tissue-specific expression of many of these 
proteins. Bc^lly important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses {157); methr 
ylation of QjG islands in imprinting {I5Sj; 
and promoter-enhancer and iiitroAic regions 
• :that modulate transcription; The spliceosomal 
machinery consists of multisubunit proteins 
- (Table 19) as well as structural and catalytic 
RNA elements {159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of KNA mol- 
ecules {160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



of RNA editing in which coding changes 
occur directly at the level of mRNA is of 
clirucal and biological relevance {161). Final- 
ly, examples of translational control include 
internal ribosomal entry sites that are found 
in proteins involved in cell cycle regulation 
and .apoptosis '(/(52). At the protein level, 
minor alterations in '{he .nature of protem- 
protein interactions, protein • modifications, 
and localization can have dramatic effects on 
cellular physiology {163). This dynamic sys-- 
tern therefore has many ways to modulate 
activity, which suggests that defmition of 
complex systems by analysis of single genes • 
is unlikely to be entirely successful. 

.In situ . studies have shown that the human 
genome is asymmetrically populated with 
G+C content, CpG islands, and genes {68). 
However, the genes are not distributed quite 
as unequally as had been predicted (Table 9) 
{69). The most G+C-rich fraction of the ge- 
. nome, H3. isochores, constitute more of the 
genome than previously thought (about 9%), 
and are the most gene-dense fraction, but 
contain only 25% of the genes, rather than the 
predicted -40%. The low G+C L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
cation" of the vertebrate genome (77). Why 
are there clustered regions of high and loiv 
-gene density, and iare these accidents of his- 
• tory or driven by selection and evolution? If 
-these deserts are dispensable, it ought to be 
possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many sp^cits of bats have genome 
sizes that are much .srnaller than that of hu- 
mans; for exaniple, Miniopterus, a species of 
Jtalian bat, has a genome size that is only 
50% that of humans {164). Similarly, Mun- 
tiat'us, a species of Asian barking deer, has a 
genome size that is —70% that of humans. 



8.3 Human DNA sequence variation 
and its distribution across the genome 
Hiis is the first cukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed, Althon^ we have identi- 
' fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fiiaction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogcneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift The availabiHty of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 



1346 



16 FEBRUARY 2001 VOL 291 SCIENCE www3Cienccmag.org 



types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula^ 
tion history and migration patterns. Although 
such studies have suggested that modem human 
hneages derive from Africa, many important 
. questions regarding human ori^ remain un- 
answered, and more analyses using detaUed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration, -and admix-, 
ture, SNPs can serve as markers for the extent ' 
of evolutionary coristraint acting on particiOar 
genes. The correlation between patterns of in- 
traspecies arid interspecies genetic. variation 
may prove to be especially informative to iden- 
tify sites of reduced genetic diversity that may 
marie loci where sequence variations are not 
tolerated 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces actbg on polymorphism— sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fracUon of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (J6S). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portfon of the Y 
chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the . 
population as there are autosomal chromo-* 
somes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density, of deleterious mutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller As a result, the density of 

even completely neutral SNPs will be lower 
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then docks on this, and then the complex 
moves there. . . (J67) to the exciting area 
of network perturbations, nonlinear re- 
sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other **paits lists" re- 
veals that in organisms with complex nervous 
systems, neither gene number, neuron number, 
nor number of cell types conelates in any* 



8.5 Beyond single components 
While few would disagree with the intuitive 
conclusion that Etnstein*s brain was more 
cornplex than that of Drosqphila, closer com- 
parisons such as whether the set of predicted 
human proteins is more complex thaii the 
protein set of DrosophUa, and if so, to what 
degree, are not straightforward, since protein, 
protein domain, or protein-protein interaction 



me^gful manher wift even simpUstic mea- ^^rr/rn^r ^T*""^™. 



sures of structural , or behavioral corhplexity. ' 
Nor would they be expected to; "this is the realm 
of nonlinearities and cpigenesis (/<5af). The 520 
. xnOlion neurons ofthe common octopus exceed 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a 
comparison of genomic data on fte rnouse and 
human, and from comparative mammahan neu- 
. roanatomy {169), that the morphological and 
■behavioral diversity found in mammals is un- 
deipinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 
inches tall and weighs about 6 ounces) to a 
chimpanzee, the brain volume of this minute 
primate is found to be only about 1.5 cm\ nvo 
orders of magm'tude less than that of a chimp 
and three orders less than that of humans. Yet 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
ofthe pygmy mannosei are httle different from 
those of chimpanzees. Between humans and . 
chipipanzees, the gene number, gene structures 
and functions, chromosomal and genomic or- 
• ganizatiohs, and cell types and neuroanatomies 
are almost indistinguishable, yet the develop-" 
mental modifications, thai predisposed human 
lineages tQ cortical expansion and development 
ofthe larynx, giving rise to language, cuhninat- 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

Simple examination ofthe number of neu- 
rons, cell types, or genes or of the genome 
size does hot alone account for the differenc- 
es in comiJlexity that we observe. Rather, it is 
the interactions witfiin and among these sets 



. , - . ' WW ivmrci uic uuciactions wimm ana among these sets 

in such regions There is a large literature, ...that resultiii'^oh great variation. In addition 
on the association behv*»*in .cnto 5^il-*t..^T^ ' 



on the association between -SNP .'density 
zfiA local recombination rates in Dfosoph-: 
iia, and it remains aa important task to 
assess the strengtjti-orthis association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also 'r^'mains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity arnong geographic and 
ethnic populations, - 

8.4 Genome complexity ' ' 
We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



. it is possible that there are "special cases" of 
regulatory gene net^vorks that have a dispro- 
. portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that arc significantly increased in 
the human genoji^xompared with the fly and 
worm. These include extracellular* ligands 
and their cognate receptors (e.g., wnt; friz- * 
zled, TGF-p, ephrins, and coniiexins), as well 
as" nuclear regulators (e.g., the KRAB and 
hoineo^iomain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie b these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



interactions that underpin, the 'dynamics un- 
derlying phendtype. • 

Currently, there are more than 30 different 
mathematical descriptions of complexity (77^). 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
Jc. approach to the analysis of biological sys- 
. : terns, which are composed of nom'dentical ele- 
ments, (proteins, protein complexes, interacting 
cell types, and interacting neuronal popula- 
tions), is through graph theory (I7J), The ele- 
ments ofthe system can be represented by the 
vertices of complex topographies, with the edg- 
es, representing the interactions betvvcen them. 
Examination of Jaige networks reveals that they 
can self-oiganize, but more important, they can 
be particularly robust This robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The en-or toler- 
ance of such networks comes with a price; they 
^ are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
network stability. Gene.knockouts provide an * 
. illustration. Some.larockputs may have minor 
effects, whereas others have catastrophic effects 
on the system. In the case of vimendn, a sup- 
posedly critical, component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout ofthe gene in mice reveals them to be 
reproductively nonnal, with no obvious pheno- 
• typic effects (772), and yet the usually conspic- 
uous vimendn network is completely absent 
On the other hand, -30% of knockouts in 
Drosophila and mice conespond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the networic to crash most 
of the time, although even in some of these 
cases, phenotypic nonnalcy ensues, given the 
appropriate genetic background Thus, there are 
no "good" genies or "bad" genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
^ensitivity to perturbation. Sophisficated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity,'.' particularly because 
deconvoluting and conrecting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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nomc would open up new strategies for hu- 
man biological research and would have a 
. major impact on medicine, and'through med- 
icine and public health, on society. Effects on 
biomedical research arc ah-eady being felt 
This assembly of the human genome se- 
quence is but a fust, hesitant step on a long 
- and exciting journey toward understanding 
the role of the genome in human biology. It : 
has been possible only because of inhova- 
tions in instrumentation and software that 
have allowed automation of almost eveiy step 
of the process from DNA preparation to an- • 
. notation. The aext steps a^c clear; We must 
. define the complexity that ensues when this 
^ relatively modest set of about 30,000 genes is 
expressed. The sequence provides the j5-ame- 
wrk upon which all the geneHcs, biochem- 
istiy, physiology, and ultimately phenotype ' 
depend. It provides the boundaries for scien- 
tific mquiry. The sequence is only the first 
level of understanding of the genome. All 
genes and their control elements must be 
Identified; their functions, in concert as well 
as m isolation, defmed; their sequence varia- 
tion worldwide described;- and the relation 
between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

.Another paramouht challenge awkits- 
. pubhc discussion of this information and -its' 
potential for improvement of personal health • 
Many diverse sources of data have shown 
ftat any two individuals are more than 99.9% 
. Identical in sequence, which means that all 
the glonous differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: deterniim'sm, 
the Idea that all characteristics of the person 
are ;'hard.wired" by the genomeTiid reduc- 
tionism, the view that with complete knowl- 
edge of the human genomtf sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal descriptiWof hu- " 
m^fl variability; The je^l challenge- of h'uinan .* 
biology, beyond the taskx)f finding out how 
genes orchestrate tbe cbjislruction and main- 
tenance of the miraculous mechamWof our 
bodies, will lie ahead as we seek to explain 
how our .minds have .come to 'organize 
thoughts sufficiently well to investigate our 
^wn existence. ... 
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s\isper\s]on (1 ml) containing 0.1 M NaO, 10 mM 
trls^20 mM EOTA (pH 8). 156 5DS, 1 protein- 
ase K and 10 mM dithioth/eitol for 1 hour at 37*C 
The lysate was extracted with aqueous phenol and 
with phenol/chlorofomi. The DNA was ethanot pre- 
cipitated and dissolved In 1 ml TE buffer. To make 
genomic Ubraries, DNA was nndonHy sheared, end- . 
poUshed with consecutive BAL31 nuclease and T4 
DNA polymerase treatments, and sire-selected by 
electrophoresis on IX bw^nelting-point agarost 
After Ugation to Est XI adapters (InvitrMea caUloc 
no. N408-18), DNA was purified by three rounds of 
gel electrophoresis to remove excess adiptca and 
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: }rf^ned mtoBst Xi-lmearized plasmid vector with 
3 -TGTG overhangs. Ubraries with three different 
cn' M?* ^^^^ constructed; 2, 10. and 

50 top. The 2-kbp fragments were cloned In a 
h/gh-copy pUCIS derivative. The 10- and 50-kbp 
tragments were cloned In a medium-copy pBR322 

fo^lL^A I^' ^' '"^^P ^^<'^^ ^ni- 

if^C^ V^ ^'^"^ on plating. However, the 

. 50-kbp bbrancs produced many small colonies and 
Inserts were umiable. To remedy this, the SO-kbo 
libranes were digested with Bgl II, which does not 
cleave the vector, but generally, cleaved several 
times w,th.n the 50-kbp Insert. A 1264-bp Bam HI 
kanamycin, resistance cassette (purified from 
PUCK4, Amersham Phamiacia. catalog no. 27-4958- 
Oi; was added and ligation was carried out at 37»c 
In the continual presence of Bgl II. As Bgl ||-Bgl li 
ligations occurred, they were continually deaved ' 
vvhereas Bam Hi-Bgl ri ligations were not deaved A 
high yield of Internally deleted circular Ubrary mol- 
ecules was obUined In which the residual Insert 
tHjt ^;?''^«P^^a^«^ by .the kanamycin cassette 
DMA. The internally deleted libraries, v/hen plated 
!'?r-^^/V«"''?''JJ"^ ampiciUin (50 ^g/ml). carbenl- 
cillin (50 jtg/ml). and kanamydn (15 |ig/mn. pro- 
duced relatively uniform large colonies. The result- 
ing dones could be prepared for sequencing using 
iSrarfJ^* Pf«««<^"f« « dones fronf the 10-kbp 

34. Transfonned cells were plated on agar diffusion 
plates prepared with a fresh top layer conUining no 
antibiotic poured on top of a previously set bottom 
layer containing excess antibiotic, to achieve the 
con-cct final concentration. This method of p\atir\s 
permitted the cells to develop antibiotic resistance 
tefore being exposed to antibiotic without the po- 
tential clone bias that can be Introduced through 
. -bquid outgrowth protocols. After colonies had 
gfowij QBot (Genetix. UJC) automated colony-pldc- 
Ing robots were used to pick colonies meeting strin- 
gent size and shape criteria and to Inoculate 384- 
- weU microliter plates containing Uquld grovrth me- 
' "'Ituf" ^re Incubated overnight, 

with shaking, and were scored for growth before 
passing to template preparation. Template DNA was 
• extracted from Uquid bacterial culture using a pro- 

ji'^t^y^'V^ 'y*'^ mmlprep meth- 

od (f 7J) adapted for high throughput processing In 
384-weU mlcrotlter plates. Bacterial cells were 
- - «" <»el>rls was removed by centrifiigation; 

and plasmid DNA was recovered by tsopropanot 
precipitation and resuspended In 10 mM tris-HCl 
buffer. Reagent dispensing operations were accom- 
plished using Titertek MAP 8 liquid dispensing sys- 
tems. PIate-ts>-plate Uquld transfers were performed 
using Temtec Quadra 384 Model 320 pipetting ro- 
fcots. AU plates were tradced throughout processing 
by unique plate barcodes. Mated sequendr^ reads 
from opposite ends of each done Insert were ob- 
U.ned by preparing two 384-weU cyde sequencing 
reaction plates from each pUte of pUsmId template 
DNA using ABI-PRI5M BigDye Terminator diemisto' 
(Applied Blosystems) and standard M13 forward 
and reverse primers.' Sequencing reactions were pre- 
pared using the Tomlec Quadra 384-320 pipetting- . 
robot Parent-child plate relationships and. by cx- 
/ tensioa forward-reverse sequer^cc mate pairs were 
established by automated plate barcode reading by 
V\e onboard barcode reader and were recorded by 
direct UM5 communlcatioa Sequendng reaction * 
products were purified by alcohol predpitation and 
were dried, sealed, and stored at 4*C In the dark 
onin needed for sequendng. at which time the 
reaction products were resuspended In delonlzcd 
fomiamide and sealed Immediately to prevent deg- 
radaUoa All sequence data were generated using a 
siryle sequendng pUtform, the ABI PRISM 3700 
DNA Analyzer. Sample sheets were created at load 
time using a Java-based appUcatlon that fadUUtes 
barcode scanning of the sequendng plate barcode, 
relneves sample Infonnation from the central UMS, 
and reserves unique trace Identifiers. The appUca- 
tion perrnltted a single sample sheet file In the 
bnking directory and deleted previously created 
sample sheet fHes Immediately upon scanning of a 
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89.. Lek first compares aU proteins In the proteome to 
■ .one another. Next, the resulting BLAST reports are 
.. .parsed, and a graph Is created wherein each protein 
constitutes a node; any hit between two proteins 

Si'r.iJw "f'l?^^'^" a user^spedned 

threshold constitutes an edge, lek then uses this 
graph to compute a similarity between each protein 

simply dMding the number of BLAST hits shared In 
common between the two proteins by the total 
number of proteins hit by /and/ Thisslmple metric 
J'^V'l interesting properties. First because the 
£rfi^r^»T*^^/^^« '"^^ «^«""t J'^th the siml- . 

^r**.^*.***^^*'*""^ bet^„„ sequenc- 
es at the level of BiASThixs, the metric respects the 
multidomain nature of protein space. Two rnultido- 
ma/ri. proteins, for Instance, each containing do- 
rnains A and B. wni have a greater pairv^e similarity 
to each other than either one wfll have to a protein 
containing only A or B domains, so long as A-B« 
containing multidomain proteins are less frequent In . 
the proteome than are single-domain proteins con- 
taining A or a domains. A second Interesting prop- 
erty of this similarity metric Is that It can be used to 

''i*' i""/.*^'^^"^ proteome as a 

whole without having to first produce a multiple 
aUgnment for each protein family, an error-prone 
and veiy time-consuming process, FinaUy. the met- 
nc does not require that either sequence have sig- 
nificant homology to the other In order to have a 
denned similarity to each other, only that they 
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share at teart one significant BUST hit In common. 
V This Is an especially {nteresting property of the 
metric, because It aUows the rapid recovery of pro- 
tein families from the proteome for which no mul- 
tiple alignment Is poisible. thus provfding a compu- 
tational basis for the extension of protein homology 
searches beyond those of current HMM- and profile- 
based search methods. Once the whole-proteome 
similarity matrix has been calculated, lek first par- 
titions the proteome Into single-linkage dusters 
{27) on the basis of one or more shared BLAST hits 
between two sequences. Next, these single-linkage 
dusters are further partitioned Into subclusters, 
each member of which shares a user-specified pair- 
wise similarity withithe other members of the dus- . 
.ter, as described above. For the purposes of this . 
publication we have focused on the analysis of 
single-linkage dusters and what we have termed 
•complete dusters,' e.g, those subclusters .for 
which eveiymcmber has a similarity metric of 1 to 
eveiy other member of the subcluster. We believe 
that the sIngle-Unkage and complete dusters are of 
special Interest, In part, because they allow us to 
estimate and to compare sizes of core protein sets 
In a rigorous manner. The rationale for this Is as 
follows: If one Imagines for a moment a perfect 
dustering algorithm capable of perfectly partition- • 
rng one or more perfectly annotated protein sets 
Into protein families. It 1$ reasonable to assume that 
the number of dusters will always be greater than, 
or equal to. the number of single-linkage dusters, 
because single-linkage dustering Is a maximally ag- 
glomerathre dustering method. Thus. If there exists 
a single protein In the predicted protein set contain- 
Ing domains A and B, then ft will be dustered by 
single linkage together with all slnglg^'domaln pro- 
teins containing domains A or B. Likewise, for a 
predicted protein set containing a single multido- 
main protein, the number of real dusters must" 
always be less than or equal to the number of 
■ complete dusters, because It Is Impossible to place • • 
a unique multidomaln protein Into a complete dus-" 
ter. Thus, the single-linkage and complete clusters 
. plus singletons should comprise a lower and upper 
. bound of sizes of core protein sets, respectively, 
allowing us to compare the relative size and com- 
ptexity of different organisms* predicted protein seL 
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Is 1AV. where Is the number of proteins In the set 
(for this analysis.'^ « 26,588). Allowing foV B' to 
^ occur as any of the next J-l proteins (leavings gap 
between A' and B' Incre^'es the probaSiUty to (y 
1)//ft allowing B'A' or A'B' gives a probabiUty of 2(y 

- 1)//yi. Considering thre? genes ABC the probabil- 
ity of observing A'S'^'' elsewhere In the ^nome, 
gh^en that the paralogs exist Is VN » Three pro- 
teins can occiir across a spread of five positions In 
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probability of observing a duplicated set of three 
• genes In two different locations, where the three 
genes occur across a sprt^d of five positions In both 
locations, b 36W; the expected number of such 

• matched sets In the predicted protein set Is approx- 
imately (/V}36//V2 = 36W. a value «1. Therefore, 
any such duplications of three genes are unlikely to 
result from random rearrangements of the genome. If 
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EXHIBIT K 

THE HUMAN 
^ GENOME ; 

• iimanity has been given a great gift: V/ith the completion of fte hiiinan, . , 
genome sequence, we have received a powerful tool for imlockmg tl^e 
secrets ofour genetic heritage and for finding our place among the other 

participants in the adventure of life. ^ ' 

This week's issue of Sdence contains the report of the seqifencing of 
the human genome from a group of authors led by Craig Venterlof Celera • 
Genomics. The report of the sequencing of the human genome from the 
publicly funded consortium of laboratories led by Francis Collins appwrs 

^^^^0 in week's Nature. This stunning achievement has been portrayed— 
often unfairly— as a competition between two 
ventures, one public and one private. That characterization detrac;ts from • . „ o , ■ « 

the awesome accomplishment jointly unveiled this week. Iii trutti, each . ^ IhUStOmC 
project contributed to the other. The inspired vision that launched tt»e 

publicly funded project roughly 10 years ago reflected and rewar^, • 

{he confidence of those who beUeve that the pursuit of large-scale fimda- mOmeOTt TOF 
mental problems in the life sciences is in the national interest The technial . 

innovation and drive of Craig Venter and Ws colleagues inade it poss^te- SCieOtlf Sc - 

to celebrate this accompUshment far sooner than was beheved possible. "-""^^ 
Thus, we can salute what has become, in the end, not a contest but a . : 

marriage (perhaps encouraged by shotgun) between pubhcfundmg and endeaVpTo 

private entrepreneurship. ' , 

TTiere are exceUent scientific reasons for applauding an outcome that ... 

hasSStSSwinners.Twosequenccsarebetterthanone;theopportumtyforc^^^ 
Je%?nce iSuabIe. Indeed, a ^^^^^^^^ 

be found in the pages of this issue of Science, in the comparative analysis by OUvier et al. (P; ^298^ 

Sou^ we have made the point before, it is wor^ repeating that tte sequencing of^ h^^ 
oenome reoresents not an ending, but the beginning of a new approach to biology. As Galas s^s^ 
^i^^^o T^ nsi^ic knwledge that all of the genetic components of any process can be 
S nSvSJ^vi ex^o^Sn^^^^^^^^^ to scienti^. Because of this breakthrough, resea^h 
2!. eSlom LbSg thefts of individual genes to a more integrated view that 
wS.l» eniemblTsTf gSes as they interact to form a Hving human being. Several articles m this issue 
• wSrh'v Ss ipr^^^^ is akeady beginning to revolutionize the way look ^thun^ disease^ 
™« has been a massive project, on a scale unparalleled in the history of biology, but of course 
ith^LtonTs'SnSiigh^sofcenturi^ - 

Die 6c« of cteige eoaWi olhec propnMaiy data lo bt puUisbed alterpeer renew, in a 
helpdefineusandourplaceinthegreattapcstryoflifc. : ^ : 
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MEGABLAST 1 . 2 . 3 -Paracel [2001-11-20] 
Reference ; 

Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), 
"A greedy algorithm for aligning DNA sequences", 
J Comput Biol 2000; 7 (1-2) :203-14 . 
Database : Homo__sapiens . latestgp . fa 

26,614 sequences; 200,770,258,131 total letters 

Query= LEX 2 93 SEQ ID NO 1 
(1404 letters) 

Score E 

Sequences producing significant alignments: (bits) Value 

AC011328.il. 1.216031 339 3e-90 

>AC0113 2 8 . 11 . 1 . 216031 

Length = 216031 

Score = 339 bits (171) , Expect = 3e-90 
Identities = 171/171 (100%) 
Strand = Plus / Plus 

Query: 1170 cagtggaaaacttgagccaggcatgacttacacaaaattaatcgatgcagatgttaacgt 1229 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIMIIIIillllllllllllilllll 

Sbjct : 166881 cagtggaaaacttgagccaggcatgacttacacaaaattaatcgatgcagatgttaacgt 166940 
Query: 123 0 tggaaacattacaagtgttcagttcatctggaaaaaacatttgtttgaagattctcagaa 1289 

IIIMIIIMMIIIIIIIIIIIMIIIIIIIIIIIIIMIIIIIIIIillllllMIII 

Sbjct : 166941 tggaaacattacaagtgttcagttcatctggaaaaaacatttgtttgaagattctcagaa 167000 
Query: 1290 taagttgggagcagaaatggtgataaatacatctgggaaatatggatataa 1340 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

Sbjct : 167001 taagttgggagcagaaatggtgataaatacatctgggaaatatggatataa 167051 



Score = 313 bits (158) , Expect = 2e-82 
Identities = 158/158 (100%) 
Strand = Plus / Plus 

Query: 4 8 aggaaaagaagtttgctatgaaaggttagggtgtttcaaagatggtttaccatggaccag 107 

llllllllllllilllllllllllllllllllllllMIIIIIIIIIIIIIIIIIMIII 

Sbjct : 126941 aggaaaagaagtttgctatgaaaggttagggtgtttcaaagatggtttaccatggaccag 127000 
Query: 108 gactttctcaacagagttggtaggtttaccctggtctccagagaagataaacactcgttt 167 

lllllllllllllliMllilllliMllllllllllilMllllllllllllillllll 

Sbjct : 127001 gactttctcaacagagttggtaggtttaccctggtctccagagaagataaacactcgttt 127060 



http://lexblast.lexgen.coni/blast results.cgi?id=10199&refresh=60 
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Query: 168 cctgctctacactatacacaatcccaatgcctatcagg 205 

IIIIIIIIIIIIIIIIMIIIIIIIIIIMMIMMI 

Sbjct: 127061 cctgctctacactatacacaatcccaatgcctatcagg 127098 



Score = 268 bits (135), Expect = 9e-69 
Identities = 135/135 (100%) 
Strand = Plus / Plus 

Query : 928 ggaaattgcttcttttgttccaaagaaggttgcccaacaatgggtcattttgctgataga 987 

IMIIIIIIMIIIIIilllllllllllllllllllllllllllllllllllllllllll 

Sbjct : 159417 ggaaattgcttcttttgttccaaagaaggttgcccaacaatgggtcattttgctgataga 159476 
Query: 988 tttcacttcaaaaatatgaagactaatggatcacattattttttaaacacagggtccctt 1047 

IIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIMIIIiMIIIIIIIIIIIIIIIM 

Sbjct : 159477 tttcacttcaaaaatatgaagactaatggatcacattattttttaaacacagggtccctt 159536 
Query: 1048 tccccatttgcccgt 1062 

IIIIIIMIIIIIIi 

Sbjct: 159537 tccccatttgcccgt 159551 



Score = 262 bits (132), Expect = 5e-67 
Identities = 132/132 (100%) 
Strand = Plus / Plus 

Query: 325 gtgttgctacagctggaagatataaattgcattaatttagattggatcaacggttcacgg 3 84 

llllllllMIIIIIIIIIIIIIIIIMIIMIIIMIIMIillllMIIIIIIIIIM 

Sbjct : 134614 gtgttgctacagctggaagatataaat tgcattaat ttagattggatcaacggttcacgg 134673 
Query : 3 85 gaatacatccatgctgtaaacaatctccgtgttgttggtgctgaggtggc ttattttatt 444 

lillllllMlllllllillMIIMIIIIIIIIIIMIIMMMMMIMIIMIII 

Sbjct : 134674 gaatacatccatgctgtaaacaatctccgtgttgttggtgctgaggtggcttatt ttatt 134733 

Query: 445 gatgttctcatg 456 

llllllllllll 
Sbjct: 134734 gatgttctcatg 134745 



Score = 248 bits (125) , Expect = 8e-63 
Identities = 125/125 (100%) 
Strand = Plus / Plus 

Query: 2 02 caggagatcagtgcggttaattcttcaactatccaagcctcatattttggaacagacaag 2 61 

IIIIIIIIIIIIMIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIM 

Sbjct : 133284 caggagatcagtgcggttaattcttcaactatccaagcctcatattttggaacagacaag 133343 
Query: 2 62 atcacccgtatcaacatagctggatggaaaacagatggcaaatggcagagagacatgtgc 321 

IlilllllllllMlllllllllllllllllllllllilllMlllllllllllllllM 
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Sbjct : 133344 atcacccgtatcaacatagctggatggaaaacagatggcaaatggcagagagacatgtgc 133403 
Query: 322 aatgt 326 

HIM 

Sbjct: 133404 aatgt 133408 



Score = 246 bits (124) , Expect = 3e-62 
Identities = 124/124 (100%) 
Strand = Plus / Plus 

Query: 6 85 ggtgttggaaccattgatgcttgtggtcatcttgacttttacccaaatggagggaagcac 744 

IIIIIIIIIIIIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIIillllllllllllll 

Sbjct : 151399 ggtgttggaaccattgatgcttgtggtcatcttgacttttacccaaatggagggaagcac 151458 
Query: 745 atgccaggatgtgaagacttaattacacctttactgaaatttaacttcaatgcttacaaa 804 

lllllllllllllllllllllllilllllllllllllllMIIIIIIIIIIIIIIIIIII 

Sbjct: 151459 atgccaggatgtgaagacttaattacacctttactgaaatttaacttcaatgcttacaaa 151518 
Query: 8 05 aaag 80 8 

nil 

Sbjct: 151519 aaag 151522 



Score = 244 bits (123) , Expect = le-61 
Identities = 123/123 (100%) 
Strand = Plus / Plus 

Query: 565 gggttggacccagctgggccatttttccacaacactccaaaggaagtcaggctagacccc 624 

llllllllillllliillllllllllllllllMMIMIIIIIIIIIIIIIIIIIIIII 

Sbjct: 151197 gggttggacccagctgggccatttttccacaacactccaaaggaagtcaggctagacccc 151256 
Query: 625 tcggatgccaactttgttgacgttattcatacaaatgcagctcgcatcctctttgagctt 684 

IIIMIIIIIIIIIIIIIIIIIIIMIIMMIIIIIIilllilMIIIIIIIIIIIIII 

Sbjct: 151257 tcggatgccaactttgttgacgttattcatacaaatgcagctcgcatcctctttgagctt 151316 

Query: 685 ggt 687 
III 

Sbjct: 151317 ggt 151319 



Score = 242 bits (122) , Expect = 5e-61 
Identities = 122/122 (100%) 
Strand = Plus / Plus 

Query: 807 agaaatggcttccttctttgactgtaaccatgcccgaagttatcaattttatgctgaaag 866 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

sbjct: 156280 agaaatggcttccttctttgactgtaaccatgcccgaagttatcaattttatgctgaaag 156339 
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Query: 867 cattcttaatcctgatgcatttattgcttatccttgtagatcctacacatcttttaaagc 92 6 

1 1 1 1 1 M 1 1 1 1 1 M 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 i 1 1 1 1 M 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

'Sbjct : 15634 0 cattcttaatcctgatgcatttattgcttatccttgtagatcctacacatcttttaaagc 156399 

Query: 92 7 ag 92 8 
11 

Sbjct: 156400 ag 156401 



Score = 220 bits (111) , Expect = 2e-54 
Identities = lll/lll (100%) 
Strand = Plus / Plus 

Query: 456 gaaaaaatttgaatattccccttctaaagtgcacttgattggccacagcttgggagcaca 515 

IIIIIIIIIIMIIIIIIIMIIIIIIIIIIIIIIIilllllllMIIIIIIIIMIMI 

Sbjct : 145 953 gaaaaaatttgaatattccccttctaaagtgcacttgattggccacagcttgggagcaca 146012 
Query : 516 cctggctggggaagctgggtcaaggataccaggccttggaagaataactgg 566 

IIIIIIIIIIIIIIIIIIIIIIIIMIIIIIIIMIIIIMIIIIIIIIII 

Sbjct: 146013 cctggctggggaagctgggtcaaggataccaggccttggaagaataactgg 146063 



Score = 206 bits (104) , Expect = 3e-50 
Identities = 110/112 (98%) 
Strand = Plus / Plus 

Query: 1061 gttggaggcacaaattgtctgttaaactcagtggaagcgaagtcactcaaggaactgtct 1120 

IMIIIIIIIIIIIIIilllllllllMIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct : 162000 gttggaggcacaaattgtctgttaaactcagtggaagcgaagtcactcaaggaactgtct 162059 
Query: 1121 ttcttcgtgtaggcggggcaattgggaaaactggggagtttgccattgtcag 1172 

iiiiiiiiiiiiiillilii II iiililiiliiiiiiiiiiiiiiiiiil 

Sbjct : 162060 ttcttcgtgtaggcggggcagttaggaaaactggggagtttgccattgtcag 162111 



Score = 127 bits (64), Expect = 2e-26 
Identities = 64/64 (100%) 
Strand = Plus / Plus 

Query: 1341 atctacc ttctgtagccaagacattatgggacctaatattctccagaacctgaaaccatg 1400 

IIIIIIIIIIIIIIIIMIIIIMIIIIMIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

Sbjct : 167322 atctaccttctgtagccaagacattatgggacctaatattctccagaacctgaaaccatg 167381 
Query: 1401 ctaa 1404 

1 1 1 1 

Sbjct: 167382 ctaa 167385 
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Score = 105 bits (53) , Expect = 7e-20 
Identities = 68/73 (93%) 
strand = Plus / Plus 

Query: 2 54 cagacaagatcacccgtatcaacatagctggatggaaaacagatggcaaatggcagagag 313 

Mill Ml Mi I II I III II MM IIIIIIIIIIIIIIIIIIIM I III III Ml 

Sbjct : 131453 cagacaagatcacccatatcagcatacctggatggaaaacagatggcagatggcagggag 131512 
Query: 314 acatgtgcaatgt 32 6 

IIIMIIIIIIII 

Sbjct: 131513 acatgtgcaatgt 131525 



Score = 99.6 bits (50), Expect = 5e-18 
Identities = 50/50 (100%) 
Strand = Plus / Plus 

Query: 1 atgcttggaatttggattgttgcattcttgttctttggcacatcaagagg 50 

lllllillillllllllllllllllllMIIIMIIIIIIIIMIIIIM 

Sbjct: 118245 atgcttggaatttggattgttgcattcttgttctttggcacatcaagagg 118294 



Database : Homo_sapiens . latestgp . fa 

Posted date: Feb 19, 2004 9:43 AM 
Number of letters in database: 200,770,258,131 
Number of sequences in database: 26,614 

Lambda K H 

1.37 0.711 1.31 

Gapped 

Lambda K H 

1.37 0.711 1.31 



Matrix: blastn matrix:! -3 

Gap Penalties: Existence: 0, Extension: 0 
Number of Hits to DB : 0 

length of query: 2810 

length of database: 200,770,258,131 
effective HSP length: 22 
effective length of query: 13 82 
effective search space used: 0 
T: 0 
A: 0 

XI: 0 ( 0.0 bits) 
X2: 20 (39.6 bits) 
SI: 12 (24.3 bits) 
S2: 38 (75.8 bits) 
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